Updated: June 24, 2026
7 min read

Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes

Direct Answer

The paper introduces decision fidelity as a new benchmark for measuring how well large‑language‑model (LLM) user simulators replicate the actual purchase decisions of real customers, rather than merely mimicking human‑like dialogue. This matters because sales‑oriented conversational agents are increasingly trained and evaluated against simulated users, and a systematic “disengagement deficit” shows that current simulators dramatically over‑estimate the willingness of non‑buyers to stay in the conversation.

Background: Why This Problem Is Hard

LLM‑as‑user‑simulation has become the de‑facto infrastructure for building and testing conversational AI agents. Existing evaluation pipelines focus on communicative fidelity—the degree to which a simulated user produces language that looks like a human’s. Researchers typically collect paid participants, assign them a goal (e.g., “buy a laptop”), and then compare the simulated dialogue to the recorded human dialogue.

Two fundamental blind spots arise from this approach:

Exogenous motivation. When a participant is told to “try to buy,” their willingness to purchase is artificially injected. Real customers, however, arrive with latent, decaying motivations that may lead them to disengage at any moment.
Outcome‑driven dynamics. Communicative fidelity does not capture whether the simulated user will actually follow through on a purchase, ask for a discount, or simply walk away. In high‑stakes sales funnels, the difference between a simulated buyer and a real buyer can be the difference between a $10 K contract and a lost opportunity.

Because most benchmark suites (e.g., tau‑bench) and training loops assume that simulated users will behave like real users in every respect, any systematic deviation in decision dynamics can silently corrupt the entire development pipeline. Detecting and quantifying that deviation requires a testbed where real conversational outcomes—especially verified payments—are available.

What the Researchers Propose

The authors propose a decision‑fidelity framework that directly measures whether a simulated population reproduces the decision‑state dynamics of real users facing consequential choices. The framework consists of three conceptual components:

Real‑world testbed. A corpus of 2,790 production conversations between an LLM sales agent and actual customers, of which 793 have verified purchase outcomes (payment confirmed or not).
Teacher‑forced probe protocol. A controlled evaluation where the conversational context (the transcript up to a given turn) and the decision instrument (the price, product specs, and checkout link) are held constant while swapping the “user” role between a real participant and a simulated LLM.
Outcome‑conditioned bias analysis. Statistical comparison of simulated responses conditioned on the known final outcome (buyer vs. non‑buyer) to surface systematic over‑ or under‑estimation of engagement, resistance, and deliberation.

By anchoring the simulation to real outcomes, the framework moves beyond “does the bot sound human?” to “does the bot make the same purchase decisions a real human would make?”

How It Works in Practice

The workflow can be visualized as a three‑stage pipeline:

1. Data Collection and Outcome Tagging

Every interaction between the LLM sales agent and a live customer is logged. After the conversation, the system checks for a payment event (via the e‑commerce backend) and tags the dialogue as buyer or non‑buyer. This creates a ground‑truth label that is unavailable in typical user‑simulation studies.

2. Teacher‑Forced Probing

For a given turn in a real conversation, the researchers freeze the entire preceding context and the checkout instrument. They then prompt a candidate LLM to continue the dialogue as the “user,” explicitly instructing it to either “stay in the conversation” or “disengage” based on a binary flag. This isolates the effect of the LLM’s decision policy from any variation in context.

3. Bias Quantification

The simulated responses are annotated for three behavioral dimensions:

Depth of engagement – how many follow‑up questions or clarifications the user provides.
Expressed resistance – explicit statements of “not interested” or “maybe later.”
Deliberation intensity – the proportion of the turn spent weighing price, features, or alternatives.

These dimensions are then compared across two groups: (a) simulated users conditioned on the real outcome “buyer,” and (b) simulated users conditioned on “non‑buyer.” The difference reveals systematic biases.

What makes this approach distinct is the outcome‑conditioned lens. Traditional simulators are evaluated only on linguistic similarity; here, the same linguistic prompt is examined under two opposite real‑world outcomes, exposing how the model’s internal decision policy diverges from reality.

Evaluation & Results

The authors ran the teacher‑forced probe on three LLM families: the primary model (a proprietary DeepSeek variant) and two open‑source baselines. The key findings are summarized below.

Disengagement Deficit

When the ground‑truth label was “buyer,” simulated users matched real buyers closely, showing a negligible depth bias of +0.09 (i.e., simulated dialogues were only slightly longer). However, for “non‑buyer” cases, simulators exhibited a pronounced “disengagement deficit”:

Depth bias +0.40 – simulated non‑buyers produced almost twice as many follow‑up questions as real non‑buyers.
Resistance halved – expressed resistance dropped from 25.1 % (real) to 13.5 % (simulated).
Deliberation doubled – the proportion of turns spent weighing options rose from 21.9 % to 40.1 %.
No fabricated purchases were observed; the bias manifested purely in the willingness to stay engaged.

Statistical testing confirmed the effect (Cohen’s d = 0.38, p < 0.001), indicating a robust, outcome‑conditioned distortion.

Cross‑Model Consistency

The disengagement deficit persisted across model families. For the DeepSeek variant, the effect size was d = 0.41 (p = 0.002). Open‑source baselines showed similar patterns, suggesting the issue is not limited to a single architecture but is inherent to current LLM training objectives that prioritize fluent continuation over decision termination.

Mitigation Attempt

The researchers experimented with an explicit instruction: “You may choose to disengage if you are not interested.” This reduced the overall depth bias by roughly five‑fold, yet the outcome‑conditioned contrast (buyer vs. non‑buyer) remained statistically significant (d = 0.34, p = 0.008). In other words, a simple prompt tweak does not close the gap.

These results collectively demonstrate that while LLM simulators can faithfully reproduce the language of buyers, they systematically over‑estimate the persistence of non‑buyers, leading to an inflated perception of funnel conversion rates.

Why This Matters for AI Systems and Agents

For product managers, AI researchers, and engineers building sales‑oriented conversational agents, the findings have three immediate implications:

Evaluation bias. Benchmarks that rely on simulated users will report higher conversion metrics than what would be observed in the wild, potentially causing premature product releases.
Training feedback loops. Reinforcement‑learning‑from‑human‑feedback (RLHF) pipelines that use simulated dialogues as “ground truth” may reinforce the disengagement deficit, making agents even more persuasive toward users who would otherwise walk away.
Revenue forecasting. Companies that model sales pipelines using LLM simulators risk over‑projecting revenue, especially in high‑ticket B2B scenarios where non‑buyer disengagement is common.

Addressing the deficit is therefore not a nicety but a prerequisite for trustworthy AI‑driven sales automation. Organizations that already deploy AI marketing agents on the UBOS platform overview should audit their simulation pipelines against decision‑fidelity metrics to avoid hidden optimism.

What Comes Next

While the paper makes a compelling case for decision fidelity, several open challenges remain:

Limitations

Domain specificity. The dataset focuses on a single product category (software subscriptions). Generalizing to hardware, services, or multi‑step contracts may reveal different bias patterns.
Static instrumentation. The teacher‑forced probe holds the checkout instrument constant. Real‑world pricing dynamics (discounts, bundles) could interact with user disengagement in non‑linear ways.
Single‑turn analysis. The study measures bias at isolated turns rather than over the full conversation trajectory, leaving open the question of cumulative effects.

Future Research Directions

Develop decision‑aware prompting strategies that explicitly model the probability of disengagement, perhaps by integrating a termination token into the LLM’s output space.
Incorporate reinforcement signals from real‑world conversion data into the simulator’s training loop, creating a closed feedback loop that aligns simulated decisions with observed outcomes.
Explore multimodal cues (tone of voice, facial expression) via integrations such as ElevenLabs AI voice integration to enrich the decision context beyond text.

Practitioners can start by augmenting their evaluation suites with the teacher‑forced probe protocol, publishing decision‑fidelity scores alongside traditional communicative‑fidelity metrics. Over time, a community‑wide benchmark that captures both language quality and decision realism will enable more reliable, revenue‑safe AI agents.

For a deeper dive into the original methodology and raw numbers, consult the original arXiv paper.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

1. Data Collection and Outcome Tagging

2. Teacher‑Forced Probing

3. Bias Quantification

Evaluation & Results

Disengagement Deficit

Cross‑Model Consistency

Mitigation Attempt

Why This Matters for AI Systems and Agents

What Comes Next

Limitations

Future Research Directions

Carlos

Customer Relationship Management (CRM)

AI Video Generator

Image to text with Claude 3

AI Chat Bot: Text, Voice, and Video Magic

Unified Authorization Template

Sarcastic AI Chat Bot

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

1. Data Collection and Outcome Tagging

2. Teacher‑Forced Probing

3. Bias Quantification

Evaluation & Results

Disengagement Deficit

Cross‑Model Consistency

Mitigation Attempt

Why This Matters for AI Systems and Agents

What Comes Next

Limitations

Future Research Directions

Share

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password