- Updated: January 24, 2026
- 5 min read
Uncovering Latent Bias in LLM-Based Emergency Department Triage Through Proxy Variables
Direct Answer
The paper introduces a systematic framework for detecting and quantifying demographic bias in large‑language‑model (LLM) driven emergency department (ED) triage systems, showing that subtle token‑level influences can skew severity assessments across age, gender, and ethnicity groups. This matters because biased triage decisions can directly affect patient outcomes, resource allocation, and legal compliance in high‑stakes clinical environments.
Background: Why This Problem Is Hard
Emergency departments rely on rapid, accurate triage to prioritize care. Recent deployments of LLM‑based decision‑support tools promise to automate note‑taking, symptom extraction, and severity scoring, but they inherit the opaque statistical patterns of their training data. The core challenges are:
- Hidden proxy variables: Demographic signals often appear indirectly (e.g., “retired” implying age, “Spanish‑speaking” implying ethnicity), allowing the model to infer protected attributes without explicit mention.
- Distribution shift: Training corpora are typically sourced from general‑purpose internet text, not from the nuanced, high‑acuity language of ED encounters.
- Evaluation scarcity: Standard performance metrics (accuracy, F1) do not capture systematic over‑ or under‑triage for specific groups, making bias invisible in routine validation.
Existing bias‑detection methods—such as counterfactual data augmentation or fairness‑aware loss functions—assume access to ground‑truth protected attributes and often require re‑training, which is impractical for proprietary hospital LLM deployments that treat models as black boxes.
What the Researchers Propose
The authors present Bias‑Aware Triage Auditing (BATA), a three‑stage, model‑agnostic pipeline that isolates and measures the impact of demographic proxies on triage severity scores:
- Proxy Identification: Using a curated lexicon of demographic‑related tokens (e.g., “elderly”, “pregnant”, “non‑English”), the system scans triage narratives to flag potential proxy occurrences.
- Counterfactual Perturbation: For each flagged token, the pipeline generates a minimally edited version of the clinical note where the proxy is swapped with a neutral alternative (e.g., “elderly” → “patient”).
- Impact Quantification: The original and perturbed notes are fed to the LLM triage model; the difference in predicted acuity levels is recorded as the proxy’s bias contribution.
Key components include a Token‑Scanner (lexicon‑based), a Text‑Perturber (rule‑based or neural paraphraser), and a Bias‑Metric Engine that aggregates per‑token effects into group‑level disparity scores.
How It Works in Practice
Figure 1 (illustrated below) depicts a typical workflow when integrating BATA into an ED’s AI stack:
- Incoming Triage Note: A nurse dictation is transcribed by an automatic speech recognizer and passed to the LLM for severity prediction.
- Real‑time Scanning: The Token‑Scanner parses the note, flags any demographic proxies, and logs their positions.
- On‑Demand Perturbation: Before the LLM finalizes its prediction, the Text‑Perturber creates a parallel “neutralized” note.
- Dual Inference: Both the original and neutralized notes are evaluated by the same LLM instance, producing two acuity scores.
- Bias Reporting: The Bias‑Metric Engine computes the delta, updates a dashboard, and, if the delta exceeds a pre‑set threshold, triggers an alert for clinical review.
What sets this approach apart is its post‑hoc nature: it does not require model retraining, nor does it need explicit demographic labels in the patient record. Instead, it leverages linguistic cues that are already present in everyday triage documentation.

Evaluation & Results
The researchers evaluated BATA on two real‑world datasets:
- Dataset A: 12,000 de‑identified triage notes from a metropolitan hospital, annotated with age, gender, and self‑reported ethnicity.
- Dataset B: 8,500 synthetic notes generated to reflect under‑represented demographic scenarios (e.g., non‑binary gender, limited English proficiency).
Key findings include:
- Replacing age‑related proxies (e.g., “elderly”) with neutral terms reduced the average predicted acuity by 0.42 levels for patients over 65, indicating systematic over‑triage.
- Gendered language (“he/she”) introduced a 0.18‑level shift favoring male patients in pain‑assessment scores.
- Ethnicity proxies (“Spanish‑speaking”) caused a 0.31‑level increase in urgency classification, suggesting a bias toward higher resource allocation for certain language groups.
- Overall, BATA identified statistically significant disparities (p < 0.01) across all protected groups, while traditional accuracy metrics remained unchanged.
These results demonstrate that even high‑performing LLM triage models can embed hidden biases that only surface when examined through targeted counterfactual analysis.
Why This Matters for AI Systems and Agents
For practitioners building AI‑augmented clinical workflows, the implications are threefold:
- Risk mitigation: Early detection of proxy‑driven bias enables hospitals to intervene before adverse patient outcomes occur, aligning with regulatory expectations such as the U.S. FDA’s Good Machine Learning Practice guidelines.
- Design feedback: By quantifying which tokens exert the strongest bias, developers can refine prompt engineering, adjust token embeddings, or introduce fairness‑aware fine‑tuning without overhauling the entire model.
- Operational transparency: Integrating BATA’s dashboards into existing agent orchestration platform provides clinicians with real‑time explanations of why a particular acuity score may be inflated, fostering trust in AI‑assisted decision making.
In essence, the framework bridges the gap between black‑box LLM predictions and the accountability standards required for life‑critical applications.
What Comes Next
While BATA marks a significant step forward, several limitations remain:
- Lexicon coverage: The current proxy list captures common demographic cues but may miss nuanced or culturally specific expressions.
- Scalability of perturbations: Generating high‑quality counterfactuals for long, complex notes can increase latency, which is critical in fast‑moving ED settings.
- Generalization: The framework has been tested on English‑language notes; extending it to multilingual environments will require language‑specific token libraries.
Future research directions include:
- Automated discovery of proxy tokens using unsupervised clustering of embedding spaces.
- Integration of lightweight, on‑device perturbation models to keep inference time sub‑second.
- Coupling BATA with continuous bias monitoring tools that learn from post‑deployment feedback loops.
By addressing these challenges, healthcare organizations can move toward truly equitable AI‑driven triage, ensuring that every patient receives care calibrated to clinical need rather than inadvertent demographic signals.
For a complete technical description, see the original arXiv paper.