Updated: March 11, 2026
6 min read

The Value Sensitivity Gap: How Clinical Large Language Models Respond to Patient Preference Statements in Shared Decision-Making

Direct Answer

The paper introduces the Value Sensitivity Gap framework, a systematic way to measure how clinical large language models (LLMs) react to explicit patient‑value statements during shared decision‑making. It matters because even when models acknowledge patient preferences, they often fail to shift recommendations accordingly, exposing a hidden source of bias that can affect treatment outcomes and regulatory compliance.

Background: Why This Problem Is Hard

Shared decision‑making (SDM) is a cornerstone of modern, patient‑centered care. In SDM, clinicians present evidence‑based options while actively incorporating the patient’s values, goals, and lifestyle preferences. Translating that nuanced dialogue into a prompt for an LLM seems straightforward, but three intertwined challenges make it difficult:

Implicit value weighting: Most LLMs are trained on large, heterogeneous corpora where medical advice is presented in a “one‑size‑fits‑all” manner. The models learn to optimize for clinical correctness, not for aligning with individual value statements.
Prompt ambiguity: Patient‑generated statements can be vague (“I want a quick recovery”) or contradictory (“I prefer natural remedies but also want the most effective treatment”). Disentangling the semantic intent requires sophisticated reasoning that current prompting techniques rarely guarantee.
Lack of evaluation standards: Existing LLM benchmarks focus on factual accuracy, toxicity, or general reasoning. None explicitly test whether a model’s recommendation changes when a patient’s preference changes, leaving a blind spot for regulators and developers.

Consequently, clinicians risk deploying AI assistants that sound empathetic but silently ignore the very preferences that define shared decision‑making. The research community has called for “value disclosure labels” to surface these hidden biases, but empirical data to populate such labels has been missing—until now.

What the Researchers Propose

The authors propose a two‑part framework:

Value Sensitivity Index (VSI): A quantitative metric that captures how much a model’s recommendation shifts when a patient’s stated value changes, normalized on a 1‑to‑5 scale.
Directional Concordance Score (DCS): A binary‑like measure that checks whether the direction of the recommendation (e.g., more aggressive vs. more conservative) aligns with the expressed patient preference.

To compute these metrics, the study introduces a factorial experiment that varies three dimensions:

LLM family: GPT‑5.2, Claude 4.5 Sonnet, Gemini 3 Pro, DeepSeek‑R1.
Clinical domain: Two distinct specialties (e.g., chronic pain management and diabetes medication selection).
Value condition: Thirteen predefined patient‑value statements ranging from “minimize side effects” to “prioritize cost savings.”

Each trial produces a recommendation matrix that the researchers compare against a ground‑truth “value‑aligned” recommendation crafted by domain experts.

How It Works in Practice

The workflow can be visualized as a pipeline:

Value Sensitivity Gap diagram

Scenario Generation: Real‑world Medicaid encounter notes (98,759 de‑identified records) are distilled into concise clinical vignettes.
Value Injection: For each vignette, a value statement is appended to the prompt (e.g., “The patient prefers a treatment that allows them to return to work within two weeks”).
Model Query: The prompt is sent to each LLM family using its standard API. The model returns a treatment recommendation and a brief justification.
Metric Computation: The recommendation is scored against the expert baseline to produce VSI and DCS values.
Mitigation Layer (Phase 2): Two optional interventions are tested:
- Decision‑matrix overlay: A post‑processing step that re‑ranks options based on a pre‑defined value‑weight matrix.
- VIM self‑report: The model is asked to self‑assess its alignment with the stated value before finalizing the answer.

What sets this approach apart is the explicit separation of “acknowledgment” (the model repeats the patient’s value) from “alignment” (the model’s recommendation actually moves in the expected direction). Most prior work conflates the two, reporting high empathy scores while ignoring recommendation drift.

Evaluation & Results

The authors ran 104 trials in Phase 1 and an additional 78 trials in Phase 2 after applying mitigation techniques. Key takeaways:

LLM Family	Default Aggressiveness (1‑5)	Value Sensitivity Index (VSI)	Directional Concordance (DCS)
GPT‑5.2	2.0	0.27	0.92
Claude 4.5 Sonnet	2.8	0.22	0.85
Gemini 3 Pro	3.5	0.13	0.78
DeepSeek‑R1	2.5	0.19	0.81

All models acknowledged patient values in 100 % of non‑control trials, confirming that current prompting reliably elicits empathy. However, the VSI values—ranging from 0.13 to 0.27—show that actual recommendation shifts are modest. Directional concordance varied more widely, with GPT‑5.2 achieving near‑perfect alignment (0.92) while Gemini 3 Pro lagged (0.78).

When the decision‑matrix or VIM self‑report mitigations were applied, DCS improved by an average of 0.125 across the 78 Phase 2 trials, demonstrating that lightweight post‑processing can meaningfully close the value sensitivity gap.

These findings provide the empirical backbone needed for the “value disclosure labels” advocated by emerging clinical AI governance frameworks.

Why This Matters for AI Systems and Agents

For developers building AI‑driven clinical assistants, the study delivers three actionable insights:

Metric‑driven development: Incorporating VSI and DCS into continuous evaluation pipelines can surface hidden misalignments before deployment.
Governance readiness: The quantitative data can populate AI governance framework value‑disclosure labels, satisfying regulators who demand transparency about how models treat patient preferences.
Design of orchestration layers: The decision‑matrix overlay shows that a thin orchestration layer—rather than a full model retraining—can boost alignment, reducing engineering overhead and preserving model performance on other metrics.

In practice, an LLM‑powered symptom checker could first generate a list of evidence‑based options, then apply a value‑weight matrix derived from the patient’s stated goals (e.g., “avoid injections”). The final recommendation would be both clinically sound and value‑aligned, increasing trust and adherence.

What Comes Next

While the paper makes a strong case for measuring value sensitivity, several limitations point to fertile ground for future work:

Scope of values: The study examined 13 predefined statements. Real‑world conversations involve a far richer, sometimes contradictory set of values that may evolve over time.
Domain generalization: Only two clinical specialties were tested. Extending the framework to surgical decision‑making, mental health, or pediatric care could reveal new patterns.
Model‑level interventions: The mitigations were post‑hoc. Training LLMs with value‑aware objectives or fine‑tuning on value‑annotated corpora could embed sensitivity more deeply.
Human‑in‑the‑loop validation: The current ground‑truth is expert‑crafted. Incorporating real patient feedback would close the loop between algorithmic metrics and lived experience.

Potential applications beyond direct patient care include:

Clinical trial recruitment platforms that match participants based on personal health goals.
Health insurance decision tools that respect cost‑sensitivity preferences while staying compliant.
Medical education simulators that teach trainees how to elicit and honor patient values.

Developers interested in operationalizing these ideas can explore clinical LLM orchestration patterns that embed value‑weight matrices into existing AI pipelines.

Conclusion

The “Value Sensitivity Gap” study shines a light on a subtle but critical blind spot in clinical AI: the difference between sounding empathetic and actually adapting recommendations to patient values. By providing concrete metrics, a reproducible experimental design, and evidence that simple mitigations improve alignment, the paper equips practitioners, regulators, and researchers with tools to make AI‑assisted shared decision‑making genuinely patient‑centered.

As large language models become routine collaborators in healthcare, integrating value‑sensitivity evaluation into every stage—from data collection to post‑deployment monitoring—will be essential for ethical, effective, and trustworthy AI.

Read the full study for a deeper dive: original arXiv paper.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

The Value Sensitivity Gap: How Clinical Large Language Models Respond to Patient Preference Statements in Shared Decision-Making

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Carlos

Image Generation with Stable Diffusion

Multi-language AI Translator

AI Chatbot Starter Kit

Talk with Claude 3

Unified Authorization Template

AI-Powered Essay Outline Generator

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Share

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password