- Updated: March 11, 2026
- 2 min read
Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
Abstract: Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring (RTM) time series, but their ability to faithfully capture clinically significant events remains uncertain. In this article we present an event‑based evaluation framework for multimodal clinical time‑series summarization, built on the Technology‑Integrated Health Management (TIHM‑1.5) dementia monitoring dataset. Our study benchmarks three approaches – zero‑shot prompting, statistical prompting, and a vision‑based pipeline – and reveals a striking gap between conventional semantic‑similarity metrics and true clinical event fidelity.
Read more about our methodology and results on the UBOS blog and explore related resources at UBOS resources.
Why Event‑Based Evaluation Matters
Traditional evaluation of LLM‑generated summaries focuses on semantic similarity and linguistic quality, overlooking whether the summary correctly reports events such as sustained abnormalities. Our framework extracts daily clinical events using rule‑based abnormal thresholds and temporal persistence criteria, then aligns model‑generated narratives with these structured facts.
Key Evaluation Metrics
- Abnormality Recall – proportion of true abnormal events captured.
- Duration Recall – accuracy of reported event durations.
- Measurement Coverage – completeness of reported vital signs.
- Hallucinated Event Mentions – false events introduced by the model.
Results Overview
Our experiments show that models achieving high semantic similarity scores often have near‑zero abnormality recall. In contrast, the vision‑based pipeline, which renders time‑series visualizations for the LLM, attains the strongest event alignment with 45.7 % abnormality recall and 100 % duration recall.
Implications for Clinical AI
The findings underscore the necessity of event‑aware metrics to ensure reliable clinical summarization. Deploying LLMs in remote monitoring without such safeguards could lead to missed or misrepresented clinical events, compromising patient safety.
For a visual representation of the event‑based evaluation framework, see the diagram below:

Future Directions
We plan to extend the framework to other clinical domains, integrate real‑time alerting, and explore hybrid models that combine textual and visual reasoning.
Stay tuned for more updates on our UBOS blog.