- Updated: March 11, 2026
- 7 min read
CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework
Direct Answer
The paper introduces CARE (Clinical Accountability in multi‑modal medical Reasoning with an Evidence‑grounded agentic framework), a modular system that separates visual grounding, entity extraction, and reasoning into dedicated agents and then orchestrates them with a planner‑reviewer loop. By grounding answers in pixel‑level evidence, CARE improves diagnostic accuracy and offers a transparent audit trail that aligns with clinicians’ evidence‑based workflows.
Background: Why This Problem Is Hard
Medical imaging AI has made rapid strides, yet most large visual language models (VLMs) operate as monolithic black boxes. In practice, radiologists follow a staged process: they first identify regions of interest (ROIs), then interpret those findings against clinical knowledge, and finally document a diagnosis with supporting evidence. End‑to‑end VLMs skip these intermediate steps, which creates two critical issues:
- Shortcut learning: Models may latch onto spurious correlations in the training data, producing plausible‑sounding answers that lack a genuine visual basis.
- Hallucination and lack of accountability: Without explicit evidence, it is impossible to verify whether a model’s answer truly reflects the image content, undermining trust in high‑stakes clinical settings.
Existing attempts to inject interpretability—such as attention maps or post‑hoc explanations—are often unreliable because they are generated by the same model that produces the answer. Moreover, the medical VQA benchmarks reveal a persistent gap between raw accuracy and clinically acceptable performance, highlighting the need for a framework that mirrors the clinician’s workflow.
What the Researchers Propose
CARE reframes multi‑modal medical reasoning as an agentic pipeline rather than a single neural pass. The architecture consists of three specialized components:
- Entity Proposer (compact VLM): Scans the full image and proposes a concise list of medically relevant entities (e.g., “pulmonary nodule,” “edema”). This step reduces the search space for downstream modules.
- Evidence Generator (entity‑referring segmentation model): Takes each proposed entity and produces a pixel‑accurate mask that localizes the ROI. Because this model is trained on segmentation datasets, its masks serve as verifiable evidence.
- Grounded Reasoner (augmented VLM): Receives the original image together with the ROI masks and generates a diagnostic answer, explicitly conditioned on the visual evidence.
A fourth layer, the Coordinator, plans which tools to invoke, monitors evidence‑answer consistency, and performs a final verification before releasing the response. Reinforcement learning with verifiable rewards aligns the entire pipeline toward answers that can be directly traced back to the generated masks.
How It Works in Practice
The CARE workflow can be visualized as a step‑by‑step dialogue between agents:
- Question intake: A clinician submits a VQA query (e.g., “Is there a fracture in the left femur?”) along with the radiograph.
- Planning: The Coordinator decides to call the Entity Proposer first, based on the question type.
- Entity proposal: The compact VLM returns a shortlist such as “femoral shaft,” “fracture line.”
- Evidence generation: For each entity, the segmentation model produces a mask highlighting the suspected fracture region.
- Reasoning: The Grounded Reasoner receives the original image plus the masks, and formulates an answer like “A non‑displaced transverse fracture is present in the mid‑shaft of the left femur.”
- Review: The Coordinator cross‑checks that the answer references the provided masks; if a mismatch is detected, it can request a re‑run or flag uncertainty.
- Delivery: The final answer, together with the visual evidence (mask overlay), is returned to the clinician.
This decomposition yields two practical advantages:
- Reduced hallucination: Because the reasoner must attend to explicit masks, it cannot fabricate findings that are not visually grounded.
- Auditability: Every answer is accompanied by a concrete ROI, enabling clinicians to verify and, if needed, contest the AI’s conclusion.

Evaluation & Results
To validate CARE, the authors benchmarked the system on two widely used medical VQA datasets: VQA‑Rad and PathVQA. They compared three configurations:
- Baseline VLM: A single 10‑billion‑parameter model trained end‑to‑end.
- CARE‑Flow: The modular pipeline without the Coordinator (i.e., static tool ordering).
- CARE‑Coord: The full agentic system with dynamic planning and answer review.
Key findings include:
- CARE‑Flow achieved a 10.9 % absolute improvement in average accuracy over the baseline VLM of the same size, demonstrating that decoupling grounding from reasoning yields measurable gains.
- CARE‑Coord added another 5.2 % boost, surpassing even heavily pre‑trained state‑of‑the‑art models that rely on massive data scaling.
- Qualitative analysis showed a dramatic reduction in hallucinated answers: over 80 % of incorrect baseline responses were corrected when evidence masks were enforced.
- Human radiologists reported higher trust scores for CARE outputs because the visual evidence matched their own inspection patterns.
These results indicate that an evidence‑grounded, agentic architecture not only raises raw performance metrics but also aligns AI behavior with clinical expectations of transparency and accountability.
Why This Matters for AI Systems and Agents
From a systems‑engineering perspective, CARE illustrates a blueprint for building trustworthy AI assistants in regulated domains:
- Modular orchestration: By treating each capability (entity detection, segmentation, reasoning) as a distinct service, developers can swap or upgrade components without retraining the entire pipeline.
- Reinforcement‑learning alignment: Verifiable rewards tied to evidence consistency provide a concrete objective for fine‑tuning, a practice that can be generalized to other safety‑critical tasks.
- Agentic control loops: The Coordinator’s planning and review stages resemble the “chain‑of‑thought” prompting used in large language models, but with explicit tool calls, making the process more deterministic and auditable.
Practitioners building multi‑modal agents can adopt CARE’s pattern to:
- Reduce reliance on monolithic models that are difficult to interpret.
- Introduce evidence‑generation modules that produce verifiable artifacts (e.g., masks, heatmaps, structured reports).
- Implement a supervisory agent that enforces consistency checks before exposing results to end‑users.
These design principles are directly applicable to emerging platforms for AI‑driven diagnostics, triage bots, and decision‑support systems. For teams looking to prototype such pipelines, the ubos.tech agents framework offers reusable building blocks for tool orchestration.
What Comes Next
While CARE marks a significant step toward accountable medical AI, several open challenges remain:
- Scalability of segmentation models: High‑resolution CT or MRI scans demand memory‑efficient segmentation, prompting research into lightweight, patch‑based evidence generators.
- Generalization across modalities: Extending the framework to histopathology, ultrasound, or multimodal reports will require modality‑specific entity proposers and evidence types.
- Human‑in‑the‑loop refinement: Integrating real‑time clinician feedback to correct or refine masks could further reduce error rates and improve model calibration.
- Regulatory validation: Formal verification of the evidence‑answer chain will be essential for FDA or CE approval pathways.
Future research may explore:
- Learning to propose entities directly from textual queries using cross‑modal attention, reducing the need for a separate proposer.
- Hierarchical planning where the Coordinator can invoke multiple evidence generators (e.g., lab values, prior reports) in addition to imaging.
- Open‑source tool registries that standardize evidence formats, enabling plug‑and‑play interoperability across institutions.
Developers interested in building such orchestrated pipelines can experiment with the ubos.tech orchestration layer, which supports dynamic tool selection, stateful planning, and result verification out of the box.
For a deeper dive into the methodology and to access the full technical details, see the original arXiv paper.
As AI continues to permeate clinical decision‑making, frameworks like CARE provide a pragmatic path toward systems that are not only accurate but also explainable, auditable, and aligned with the rigorous standards of medical practice.