Updated: June 27, 2026
7 min read

Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment

Direct Answer

The paper introduces Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment, a benchmark that evaluates compact multimodal large language models (MLLMs) on the INSPECT dataset—over 23,000 CT pulmonary angiography (CTPA) studies linked with longitudinal electronic health record (EHR) data. By framing diagnosis and prognosis as structured clinical questions, the authors demonstrate that even lightweight models can meaningfully combine imaging and textual evidence to support early PE detection and risk stratification.

Background: Why This Problem Is Hard

Pulmonary embolism (PE) remains a leading cause of preventable mortality in hospitals. Timely diagnosis hinges on interpreting high‑resolution CTPA scans, while accurate risk assessment requires integrating decades of patient history, lab results, and medication records. The clinical workflow is therefore inherently multimodal:

Imaging complexity: CTPA images are volumetric, contain subtle contrast patterns, and demand expert radiologist interpretation.
Temporal EHR data: Patient trajectories span admissions, discharge summaries, and follow‑up notes, each with varying structure and terminology.
Decision urgency: Delays of even a few hours can shift treatment from anticoagulation to thrombolysis, dramatically affecting outcomes.

Existing AI solutions typically address one modality in isolation. Radiology‑focused deep nets excel at detecting emboli but ignore comorbidities that influence prognosis. Conversely, EHR‑centric models predict readmission risk without visual confirmation of the clot burden. Moreover, most state‑of‑the‑art multimodal models are massive (hundreds of billions of parameters), making them impractical for real‑time deployment in hospital IT environments where latency, cost, and regulatory compliance are strict constraints.

What the Researchers Propose

The authors propose a compact multimodal question‑answering framework that leverages efficient MLLMs to answer eight clinically relevant queries ranging from “Is there a PE present?” to “What is the 30‑day readmission risk?”. The framework consists of three core components:

Modal Encoders: Lightweight vision encoders (e.g., ViT‑B/16) process CTPA slices, while a transformer‑based text encoder ingests structured EHR snippets.
Fusion Layer: A parameter‑efficient cross‑attention module aligns visual tokens with textual tokens, enabling the model to reason jointly over imaging and record data.
Prompt‑Driven QA Head: Using zero‑shot or few‑shot prompts, the system translates the fused representation into a structured answer (e.g., binary diagnosis, probability score, or risk category).

Crucially, the framework is evaluated under three input regimes—CTPA‑Only, EHR‑Only, and CTPA+EHR—allowing a granular view of how each modality contributes to performance.

How It Works in Practice

The operational workflow can be visualized as a pipeline:

Data Ingestion: A PACS connector streams the latest CTPA study; an HL7 interface pulls the patient’s longitudinal EHR entries.
Pre‑processing: CTPA volumes are resampled to a uniform voxel spacing and sliced into 2‑D frames; textual notes are tokenized and normalized.
Encoding: Vision encoder generates a set of image embeddings per slice; text encoder produces contextual embeddings for each EHR segment.
Cross‑Modal Fusion: The fusion layer performs bidirectional attention, letting image patches attend to relevant clinical terms (e.g., “recent surgery”) and vice‑versa.
Prompt Construction: For each clinical question, a natural‑language prompt is assembled (e.g., “Based on the scan and record, does the patient have a high risk of 30‑day readmission?”).
Answer Generation: The QA head decodes the fused representation into a structured response, which can be a binary flag, a probability, or a categorical risk tier.
Integration & Reporting: The answer is fed back to the EHR system, triggering alerts, care pathways, or documentation updates.

What sets this approach apart is its emphasis on efficiency. By selecting models that fit within a few gigabytes of memory and can run inference in under a second on commodity GPUs, the framework aligns with the latency budgets of emergency departments and intensive care units.

Evaluation & Results

The researchers conducted a comprehensive benchmark on the INSPECT dataset, which aggregates 23,248 CTPA studies from 19,402 patients across multiple hospitals. Eight tasks were defined, split into two categories:

Diagnostic Tasks

PE Presence Detection (binary)
Clot Burden Estimation (low/medium/high)

Prognostic Tasks

30‑day Mortality Prediction
90‑day Readmission Risk
Long‑Term Anticoagulation Need
ICU Transfer Likelihood
Follow‑up Imaging Recommendation

Two efficient MLLMs—Gemma‑4 E4B and Gemma‑4 E2B—were evaluated under zero‑shot and few‑shot prompting. The key findings:

Multimodal advantage: When both CTPA and EHR inputs were combined, diagnostic accuracy for PE detection rose from 84.2% (CTPA‑Only) to 91.7% (CTPA+EHR) for Gemma‑4 E4B.
Prognostic boost: Readmission prediction improved by 12.4 percentage points when EHR evidence was added, highlighting the value of longitudinal data.
Few‑shot gains: Providing just three exemplars per task increased F1 scores by an average of 4.8%, confirming that minimal task‑specific tuning can unlock hidden capacity.
Model size matters less: The smaller Gemma‑4 E2B performed comparably to its larger sibling when multimodal data were present, underscoring the efficiency of the fusion design.

Overall, the experiments demonstrate that compact multimodal models can achieve clinically relevant performance without the computational overhead of billion‑parameter systems.

Why This Matters for AI Systems and Agents

From an AI engineering perspective, the study offers several actionable insights:

Agent‑centric design: The prompt‑driven QA head can be wrapped as a micro‑service, enabling autonomous agents to query “What is the PE risk for patient X?” and receive a structured answer instantly.
Orchestration simplicity: Because the model runs efficiently on modest hardware, it can be embedded within existing Workflow automation studio pipelines, reducing the need for heavyweight orchestration layers.
Evaluation framework: The eight‑task benchmark provides a reusable template for measuring multimodal agents across diagnosis, prognosis, and care‑path recommendation, facilitating continuous validation in production.
Regulatory friendliness: Smaller models are easier to audit, explain, and certify under medical device regulations, accelerating time‑to‑market for AI‑assisted decision support.

In practice, a hospital could deploy an “AI triage agent” that automatically pulls the latest CTPA, enriches it with the patient’s recent labs, and answers a set of predefined clinical questions. The agent’s output could trigger alerts in the EHR, schedule follow‑up imaging, or suggest anticoagulation dosing—effectively acting as a real‑time clinical consultant.

What Comes Next

While the benchmark marks a significant step forward, several limitations remain:

Generalizability: The INSPECT dataset, though large, originates from a limited number of institutions. External validation on diverse populations is essential.
Explainability: Current QA heads provide answers but limited rationale. Future work should integrate attention visualizations or natural‑language explanations to satisfy clinicians’ need for transparency.
Temporal reasoning: The current fusion treats EHR snippets as static inputs. Incorporating explicit time‑aware models could improve long‑term risk forecasts.
Integration depth: Seamless bidirectional communication with hospital information systems (e.g., order entry, billing) remains an engineering challenge.

Potential research directions include:

Developing prompt‑tuning strategies that adapt a single multimodal model to dozens of specialty‑specific question sets.
Exploring knowledge‑graph augmentation to enrich EHR text with structured ontologies (e.g., SNOMED CT) before fusion.
Evaluating privacy‑preserving training techniques such as federated learning across multiple hospital networks.

For organizations looking to prototype such agents, the UBOS platform overview offers a low‑code environment to stitch together vision models, text encoders, and prompt‑based QA components. Coupled with the OpenAI ChatGPT integration, developers can experiment with hybrid prompting strategies before committing to a production‑grade MLLM.

Conclusion

The “Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment” paper demonstrates that compact, well‑engineered multimodal models can bridge the gap between radiology imaging and longitudinal EHR data, delivering clinically useful answers in real time. By framing diagnosis and prognosis as structured questions, the authors provide a reproducible benchmark that can guide future AI‑agent development in high‑stakes medical domains. As healthcare systems continue to digitize, such efficient multimodal agents will become pivotal in turning raw data into actionable insights, ultimately improving patient outcomes while respecting operational constraints.

For a deeper dive into the methodology and full experimental details, refer to the original arXiv paper.

Illustration of multimodal AI workflow for pulmonary embolism risk assessment

Pulmonary Embolism Risk Assessment Workflow

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Diagnostic Tasks

Prognostic Tasks

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Carlos

AI-Powered Essay Outline Generator

Speech to Text

Unified Authorization Template

Talk with Claude 3

Customer Relationship Management (CRM)

Service ERP

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Diagnostic Tasks

Prognostic Tasks

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Share

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password