- Updated: January 30, 2026
- 6 min read
Simulating Complex Multi‑Turn Tool Calling Interactions in Stateless Execution Environments
Direct Answer
The paper introduces DiGiT‑TC (Digital Twin Generation for Tool‑Calling), a framework that automatically creates realistic, multi‑turn tool‑calling dialogues in stateless execution environments. By synthesizing high‑fidelity interaction data, DiGiT‑TC enables developers to train, evaluate, and benchmark AI language models on complex tool‑use scenarios without the need for costly human annotation.
Background: Why This Problem Is Hard
Modern AI assistants increasingly rely on tool calling—the ability to invoke external APIs, databases, or software utilities to fulfill user requests. In real‑world deployments, these calls often occur over several conversational turns, requiring the model to maintain context, handle errors, and adapt to dynamic tool responses. The challenges are threefold:
- Stateless execution constraints: Many production platforms (e.g., serverless functions, containerized micro‑services) treat each request as independent, discarding prior state. Simulating multi‑turn interactions under such constraints demands a mechanism to reconstruct context on‑the‑fly.
- Data scarcity: Collecting authentic multi‑turn tool‑calling logs is expensive, privacy‑sensitive, and often proprietary. Existing benchmark suites either simplify interactions to single turns or rely on manually curated scripts that fail to capture the richness of real usage.
- Evaluation fidelity: Without realistic test data, it is difficult to assess whether a language model can reliably orchestrate tools, recover from failures, or respect usage policies in production‑grade settings.
Current approaches attempt to address these gaps by either (a) hand‑crafting synthetic dialogues, which quickly become brittle, or (b) replaying logged sessions, which are limited in diversity and cannot be generated on demand. Neither solution scales to the breadth of tools and interaction patterns emerging in today’s AI‑driven products.
What the Researchers Propose
DiGiT‑TC reframes the data‑generation problem as a digital twin construction task. The framework consists of three core components:
- Tool Specification Engine: A declarative schema that describes each tool’s input schema, output format, error codes, and side‑effects. This engine abstracts away the actual implementation, allowing the twin to simulate any tool that conforms to the schema.
- Interaction Synthesizer: A language‑model‑driven generator that produces multi‑turn conversational flows. It leverages the tool specifications to decide when to invoke a tool, how to handle its response, and how to formulate follow‑up queries.
- Stateless Execution Emulator: A runtime that mimics the stateless nature of production services. It receives a single request (the current user utterance plus a compact context token) and reconstructs the full dialogue state by querying the synthetic history stored in a lightweight datastore.
By decoupling tool semantics from execution, DiGiT‑TC can generate unlimited, diverse dialogues that faithfully respect the constraints of stateless environments. The framework also produces paired data: the raw conversation and the exact sequence of tool calls, enabling supervised fine‑tuning and rigorous evaluation.
How It Works in Practice
The end‑to‑end workflow can be visualized as a loop:
- Define the toolset: Engineers write JSON‑like specifications for each API (e.g., a weather service, a calendar scheduler, a code executor). The spec includes required fields, optional parameters, and possible error responses.
- Seed the conversation: The Interaction Synthesizer receives a high‑level task description (e.g., “Plan a weekend trip with flight bookings and restaurant reservations”). It then generates the first user utterance.
- Stateless emulation: The emulator receives the utterance and a short context token (e.g., a hash of prior turns). It queries the synthetic history to reconstruct the full state, determines whether a tool call is needed, and invokes the Tool Specification Engine to produce a mock response.
- Iterate: The synthesized tool response is fed back to the language model, which decides the next system utterance or user clarification. This cycle repeats until a termination condition (e.g., task completion) is met.
The following diagram illustrates the data flow:

What sets DiGiT‑TC apart from prior synthetic data pipelines is its dynamic grounding in tool specifications and its ability to simulate error handling, retries, and context reconstruction—all within a single request/response cycle that mirrors real serverless deployments.
Evaluation & Results
To validate the framework, the authors conducted three complementary experiments:
- Coverage Benchmark: They measured the diversity of generated dialogues across 20 heterogeneous tools (ranging from arithmetic calculators to third‑party booking APIs). DiGiT‑TC produced over 10,000 unique multi‑turn sessions, covering 95% of the combinatorial space of tool‑call sequences defined in the specifications.
- Model Fine‑Tuning: A base GPT‑3.5 model was fine‑tuned on 5,000 DiGiT‑TC dialogues. When evaluated on a held‑out set of real‑world tool‑calling logs, the fine‑tuned model achieved a 23% absolute improvement in task success rate compared to the untuned baseline.
- Stateless Robustness Test: The authors deployed the emulator behind a serverless endpoint and measured latency and memory footprint. The stateless reconstruction added an average overhead of only 12 ms per turn, confirming that the approach is practical for production‑scale services.
Collectively, these results demonstrate that synthetic data generated by DiGiT‑TC is not only plentiful but also effective at bridging the gap between research prototypes and real‑world deployments. The improvement in task success underscores the value of training on data that faithfully mirrors the constraints of stateless tool calling.
Why This Matters for AI Systems and Agents
For practitioners building AI‑driven agents, DiGiT‑TC offers a turnkey solution to three pressing pain points:
- Rapid prototyping: Engineers can spin up a synthetic dataset for any new tool without waiting for user data, accelerating the iteration cycle for feature rollouts.
- Robust evaluation: By generating edge‑case scenarios (e.g., malformed inputs, intermittent API failures), teams can stress‑test their agents under conditions that are hard to capture in live traffic.
- Compliance and safety: Synthetic dialogues can be audited for policy violations before any real user interaction, reducing the risk of harmful outputs.
These capabilities align with the emerging need for agent orchestration platforms that support modular tool integration while preserving stateless scalability. Organizations that adopt DiGiT‑TC can expect smoother deployment pipelines, lower data‑collection costs, and higher confidence in their agents’ real‑world performance.
What Comes Next
While DiGiT‑TC marks a significant step forward, several avenues remain open for exploration:
- Cross‑domain tool composition: Extending the specification language to capture dependencies between tools (e.g., chaining a translation API with a sentiment analyzer) could unlock richer multi‑modal workflows.
- Human‑in‑the‑loop validation: Incorporating lightweight human review of generated dialogues would help surface subtle semantic errors that automated checks might miss.
- Adaptive context compression: Research into more efficient context tokenization could further reduce the latency overhead in stateless environments.
- Open‑source ecosystem: Publishing a community‑driven repository of tool specifications would accelerate adoption across industries.
Future research may also investigate how DiGiT‑TC interacts with emerging AI platform services that provide built‑in tool‑calling primitives, potentially allowing seamless integration of synthetic data pipelines into end‑to‑end MLOps workflows.
For readers interested in the full technical details, the original pre‑print is available on arXiv: Simulating Complex Multi‑Turn Tool Calling Interactions in Stateless Execution Environments.