Updated: June 29, 2026
7 min read

EHR-Complex: Benchmarking Medical Agents for Complex Clinical Reasoning

Direct Answer

The paper introduces EHR‑Complex, a large‑scale, interactive benchmark that evaluates medical AI agents on realistic, multi‑table clinical reasoning tasks built from the MIMIC‑IV database. It matters because it surfaces the hidden difficulty of longitudinal EHR analysis, revealing that even state‑of‑the‑art language models struggle to achieve reliable, exact‑match performance on real‑world queries.

Background: Why This Problem Is Hard

Electronic health records (EHRs) are the backbone of modern hospital operations, containing millions of time‑stamped observations, lab results, medication orders, and narrative notes. Turning that raw data into actionable insights requires:

Understanding complex relational schemas that span dozens of tables.
Aggregating longitudinal information across hundreds of thousands of rows per patient.
Mapping clinical codes (ICD‑10, LOINC, RxNorm) to human‑readable concepts.
Handling missing, noisy, or contradictory entries that are inevitable in real practice.

Existing benchmarks for clinical AI—such as MedQA, PubMedQA, or static SQL generation tasks—typically present clean, single‑patient snapshots or pre‑crafted queries that avoid the need for iterative interaction. Those simplifications create a research bottleneck: models that look impressive on paper often crumble when deployed against a live EHR sandbox where they must issue a sequence of SQL or Python commands, interpret error messages, and adapt their strategy on the fly.

Consequently, developers lack a rigorous yardstick for measuring an agent’s ability to navigate the full breadth of EHR complexity, and hospitals remain hesitant to trust autonomous agents with critical decision‑support workloads.

What the Researchers Propose

The authors propose EHR‑Complex, a benchmark that reframes clinical reasoning as an interactive database‑execution problem. Its core ideas are:

Real‑world substrate: The benchmark is built on the publicly available MIMIC‑IV dataset, encompassing 365 K patients, 31 relational tables, and over 500 M records.
Task taxonomy: Approximately 52 K tasks are organized into six clinical intents (e.g., cohort identification, risk stratification, medication safety, outcome prediction, longitudinal trend analysis, and population‑level statistics).
Interactive sandbox: Agents do not receive a pre‑written query. Instead, they must iteratively submit SQL statements or Python code to a sandboxed execution engine, receive results or error feedback, and refine their approach.
Complexity metrics: Queries average 31.93 structural components (joins, sub‑queries, aggregations, window functions), mirroring the depth of real clinical analytics.

In essence, EHR‑Complex treats the agent as a database analyst that must understand the schema, locate the right clinical codes, and compose multi‑step queries—exactly the workflow a data‑science team would follow.

How It Works in Practice

The benchmark’s workflow can be visualized as a loop between three components:

Task Generator: Supplies a natural‑language clinical question (e.g., “What is the 30‑day readmission rate for patients with sepsis who received vasopressors within the first 24 hours?”).
Agent Core: Interprets the question, decides whether to use SQL or Python, and emits a code snippet. The agent may be a pure LLM, a retrieval‑augmented model, or a hybrid system that calls external knowledge bases for code mappings.
Execution Sandbox: Runs the submitted code against a read‑only copy of MIMIC‑IV, returns the result set or an error message, and logs the interaction for later scoring.

Agents can query the sandbox multiple times, allowing them to:

Validate assumptions (e.g., check column existence).
Iteratively refine joins to reduce duplicate rows.
Perform post‑processing in Python when SQL alone is insufficient.

The following diagram illustrates the loop:

EHR-Complex benchmark architecture diagram

What sets this approach apart from prior static benchmarks is the requirement for compositional reasoning under execution feedback. Agents must not only generate syntactically correct SQL but also understand clinical semantics, such as the proper use of ICD‑10 codes for “sepsis” or the temporal relationship implied by “within the first 24 hours.”

Evaluation & Results

To assess the benchmark, the authors evaluated several leading LLM families (including GPT‑4‑style, Claude‑like, and open‑source alternatives) using zero‑shot prompting, few‑shot exemplars, and retrieval‑augmented variants. Evaluation metrics focused on:

Exact‑match accuracy: Whether the final result set exactly matched the gold‑standard answer.
Pass@k consistency: The proportion of tasks where at least one of the top‑k generated trajectories succeeded.
Failure mode breakdown: Categorizing errors into SQL logic faults, medical‑code lookup failures, and semantic misunderstandings.

Key findings include:

The best‑performing model achieved only 62.3 % exact‑match accuracy, far below the >90 % scores reported on simpler benchmarks.
Pass@4 fell under 50 % for almost every model, indicating high stochastic fragility when agents are forced to explore multiple query attempts.
Analysis of 3,800 failed trajectories revealed three dominant error clusters:
- SQL logic errors (e.g., missing GROUP BY, incorrect join conditions) accounted for ~45 %.
- Medical‑code lookup failures (e.g., using the wrong ICD‑10 identifier) made up ~30 %.
- Semantic misunderstandings (e.g., misinterpreting “first 24 hours” as a calendar day) comprised the remaining ~25 %.

These results demonstrate that current LLMs, even with sophisticated prompting, lack robust, end‑to‑end reasoning capabilities for longitudinal EHR analytics. The benchmark surfaces gaps that were invisible in prior static evaluations.

Why This Matters for AI Systems and Agents

For practitioners building medical AI agents, EHR‑Complex provides a realistic stress test that mirrors production workloads. The low pass rates signal that:

Agents must incorporate domain‑specific retrieval (e.g., code dictionaries) rather than relying solely on generic language modeling.
Robust error‑handling loops are essential; agents should treat sandbox feedback as a first‑class signal for query refinement.
Evaluation pipelines need to move beyond single‑shot accuracy toward trajectory‑level metrics that capture consistency across multiple attempts.

These insights directly influence the design of next‑generation clinical decision‑support platforms. For example, integrating the benchmark into a development workflow can help teams iterate on prompt engineering, tool‑use policies, and retrieval mechanisms before deploying to live EHR systems.

Developers can also leverage existing UBOS capabilities to accelerate agent prototyping:

Explore the UBOS platform overview for a modular environment that supports custom tool integration.
Consider AI marketing agents as a template for orchestrating multi‑step workflows with built‑in error handling.
Use the Workflow automation studio to design repeatable query‑generation pipelines that can be tested against EHR‑Complex.

What Comes Next

While EHR‑Complex marks a significant step forward, the authors acknowledge several limitations:

The benchmark currently relies on a single data source (MIMIC‑IV), which, although extensive, may not capture institution‑specific schema variations.
Only SQL and Python are supported as execution languages; other analytics tools (R, SAS, Spark) remain unexplored.
Ground‑truth answers are derived from deterministic queries, which may not reflect the ambiguity present in real clinical decision‑making.

Future research directions include:

Expanding to multi‑institution datasets to test cross‑schema generalization.
Incorporating probabilistic reasoning tasks, such as risk‑score estimation with uncertainty quantification.
Developing benchmark extensions that evaluate privacy‑preserving query generation (e.g., differential privacy constraints).

Practitioners interested in contributing to the next iteration of the benchmark can start by experimenting with UBOS tools that simplify data‑pipeline construction:

Startups can prototype quickly using the UBOS for startups offering, which includes pre‑configured connectors for common health‑data warehouses.
Enterprises seeking a scalable, secure deployment can explore the Enterprise AI platform by UBOS, which supports role‑based access and audit logging—critical for compliance in healthcare.
Open‑source enthusiasts may experiment with the Openclaw (Clawdbot, MoltBot) suite to build custom agents that can be benchmarked against EHR‑Complex.

Finally, the community is encouraged to read the full study for detailed methodology and to download the benchmark artifacts. The original paper is available at EHR‑Complex paper.

Conclusion

EHR‑Complex shines a light on the hidden difficulty of interactive clinical reasoning over massive, longitudinal EHR datasets. By demanding multi‑step SQL/Python interaction, it reveals that even the most advanced language models fall short of reliable performance, exposing three primary failure modes that developers must address. The benchmark therefore serves as a crucial yardstick for the next generation of medical AI agents, guiding research toward more robust retrieval, error‑aware execution loops, and domain‑specific knowledge integration. As the healthcare industry pushes toward AI‑augmented decision support, tools like UBOS can help bridge the gap between academic evaluation and production‑grade deployment, ensuring that future agents are both technically sound and clinically trustworthy.

EHR-Complex illustration

Andrii Bidochko

CTO UBOS

Andrii Bidochko is an AI entrepreneur and researcher focused on AI agents, reinforcement learning, and autonomous systems. He writes about the technologies shaping the future of machine intelligence, from frontier models and agent architectures to real-world AI applications.

EHR-Complex: Benchmarking Medical Agents for Complex Clinical Reasoning

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Andrii Bidochko

Image to text with Claude 3

Image Generation with Stable Diffusion

Unified Authorization Template

Customer Relationship Management (CRM)

Your Speaking Avatar

Service ERP

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Share

Andrii Bidochko

Sign up for our newsletter

Sign In

Register

Reset Password