Updated: June 21, 2026
7 min read

DEPART: DEcomposing PARiTy across Multilingual LLMs

Diagram of multilingual LLM variance decomposition

Direct Answer

DEPART introduces a two‑step Bayesian hierarchical framework that breaks down performance gaps across languages in multilingual large language models (mLLMs) into interpretable sources of variance. By quantifying how language features, model identity, and benchmark interactions drive disparity, the paper gives practitioners concrete levers to diagnose and reduce bias in real‑world AI systems.

Background: Why This Problem Is Hard

Multilingual LLMs are now the backbone of global AI products—from customer‑support chatbots that must understand dozens of languages to analytics engines that extract insights from multilingual corpora. Leaderboards such as MMLU‑X or XGLUE report per‑language accuracy, but they treat the resulting matrix as a static heat map. This approach hides three critical issues:

Systemic bias vs. sampling noise: A single low score could be a statistical fluke, yet many papers assume it reflects a deeper model weakness.
Opaque drivers of disparity: Language families, scripts, or typological distance are known to affect performance, but existing evaluations rarely isolate which factor is responsible for a given gap.
Lack of actionable diagnostics: Engineers cannot tell whether improving tokenization, adding more pre‑training data, or fine‑tuning on a specific benchmark will close the gap.

Traditional methods—simple correlation analyses or ablation studies on a single benchmark—fail to capture the multi‑dimensional nature of the problem. They also ignore the hierarchical structure of the data (languages nested within models, models nested within benchmarks), leading to over‑confident conclusions and misdirected engineering effort.

What the Researchers Propose

The DEPART framework tackles these shortcomings with a two‑stage Bayesian hierarchical model that respects the natural nesting of multilingual evaluation data.

Stage 1 – Language‑level variance decomposition: The model first isolates variance that can be attributed solely to language identity. It then asks how much of that variance is explained by observable language attributes such as script family, linguistic typology, and distance from English. The result is an R² score that quantifies explanatory power.

Stage 2 – Full cube decomposition: The second stage expands the view to the three‑dimensional cube of model × benchmark × language. Here, DEPART partitions variance into four interpretable components: (1) language identity, (2) model identity, (3) benchmark identity, and (4) interaction effects (model‑benchmark, model‑language, benchmark‑language, and the three‑way interaction). This granular view reveals whether a gap is driven by the model itself, the task definition, or a specific language‑task pairing.

Key components of the framework include:

Hierarchical priors: Capture shared structure across languages and models while allowing for language‑specific deviations.
Distribution‑free significance testing: Friedman and Kruskal–Wallis tests confirm that observed gaps are systematic rather than random.
Interpretability layer: By regressing language‑level variance on linguistic features, the framework surfaces the most predictive attributes (e.g., internal representational similarity to English).

How It Works in Practice

Implementing DEPART in an AI development pipeline follows a clear, repeatable workflow:

Data collection: Gather model predictions across a suite of multilingual benchmarks (e.g., XNLI, TyDiQA, multilingual reasoning tasks). Ensure each language appears in every benchmark for a complete cube.
Pre‑processing: Encode language metadata (script, family, typological distance) and compute model‑internal similarity scores to English using representation probing (e.g., CCA on hidden states).
Statistical validation: Apply Friedman and Kruskal–Wallis tests to the raw performance matrix. If the tests reject the null hypothesis, proceed to hierarchical modeling.
Stage 1 modeling: Fit a Bayesian hierarchical regression where language is a random effect and linguistic features are fixed effects. Extract the proportion of variance explained (R²_ling).
Stage 2 modeling: Extend the hierarchy to include model and benchmark random effects, plus interaction terms. Use Markov Chain Monte Carlo (MCMC) sampling to estimate posterior distributions of variance components.
Diagnostic reporting: Visualize the variance decomposition as stacked bar charts, highlighting dominant sources for each task bucket (NLU vs. reasoning).
Action planning: Translate the dominant variance sources into engineering levers—e.g., augment pre‑training data for scripts with low similarity to English, or redesign benchmark prompts for reasoning tasks where benchmark‑model interaction dominates.

What sets DEPART apart is its ability to treat the entire evaluation matrix as a single statistical object rather than a collection of independent scores. This holistic view prevents double‑counting of variance and surfaces hidden dependencies that would be invisible in per‑language bar charts.

Evaluation & Results

The authors applied DEPART to three state‑of‑the‑art multilingual LLMs (Mistral‑Multilingual‑7B, LLaMA‑2‑13B‑Multi, and XLM‑R‑Large) across 28 languages and two task families: natural language understanding (NLU) and multilingual reasoning.

Key empirical findings:

Systematic gaps confirmed: Both Friedman and Kruskal–Wallis tests yielded p‑values < 0.001, rejecting the hypothesis that observed disparities are due to random variation.
Language features explain most variance: For NLU tasks, observable language attributes accounted for 79 % of language‑level variance (R²_ling = 0.79). For reasoning tasks, the figure rose to 92 % (R²_ling = 0.92). The dominant predictor in both cases was the model’s internal representational similarity to English.
Divergent variance profiles: In NLU, model identity alone explained 66.7 % of total variance, indicating that choosing a stronger base model is the most effective lever. Conversely, for reasoning tasks, the benchmark × model interaction captured 46.3 % of variance, suggesting that task formulation and model‑specific reasoning capabilities drive performance more than language alone.

These results are not merely statistical curiosities; they reshape how engineers prioritize improvements. For example, if a product relies heavily on reasoning across low‑resource languages, the data suggest that tweaking benchmark prompts or fine‑tuning on reasoning‑specific data will yield larger gains than simply adding more multilingual pre‑training data.

Why This Matters for AI Systems and Agents

From a systems‑building perspective, DEPART provides a diagnostic toolkit that aligns directly with the engineering lifecycle of AI agents:

Targeted data augmentation: Knowing that script similarity to English predicts performance lets data engineers prioritize synthetic data generation for scripts that are under‑represented.
Benchmark‑aware model selection: For reasoning‑heavy agents (e.g., autonomous assistants that perform multi‑step planning), the interaction effect signals that a model’s reasoning style matters more than its raw size. Teams can therefore run quick pilot evaluations on a subset of benchmarks before committing to a model.
Continuous monitoring: By integrating DEPART’s variance decomposition into a Workflow automation studio, organizations can set up alerts when a new language’s similarity score drops, prompting proactive remediation.
Product‑level fairness audits: The framework’s language‑level explainability satisfies regulatory demands for transparency, allowing product managers to demonstrate that observed disparities are rooted in measurable linguistic factors rather than opaque model bias.

In practice, a multilingual AI marketing agent built on the Enterprise AI platform by UBOS could use DEPART’s insights to decide whether to invest in a new language‑specific tokenizer or to redesign the reasoning prompts that drive campaign generation. The same logic applies to voice‑enabled agents that rely on ElevenLabs AI voice integration for multilingual speech synthesis.

What Comes Next

While DEPART marks a significant step forward, several open challenges remain:

Scalability to hundreds of languages: The current study covers 28 languages; extending the framework to the full 100+ languages supported by major LLMs will require more efficient inference pipelines.
Dynamic benchmarks: As new multilingual reasoning tasks emerge, the benchmark‑model interaction component will need continual re‑estimation.
Cross‑modal extensions: Future work could incorporate vision‑language or audio‑language tasks, testing whether the same linguistic similarity metrics hold when modalities are mixed.
Integration with model‑editing tools: Combining DEPART with parameter‑efficient fine‑tuning (e.g., LoRA) could enable rapid “what‑if” simulations of variance reduction before committing compute resources.

Practitioners interested in applying DEPART can start by exploring the UBOS templates for quick start, which include pre‑built pipelines for data collection, hierarchical modeling, and variance visualization. For teams building conversational agents that span Telegram and ChatGPT, the ChatGPT and Telegram integration offers a low‑friction way to surface language‑specific performance dashboards directly within the chat interface.

Ultimately, turning multilingual evaluation from a static leaderboard into a diagnostic engine will accelerate the delivery of equitable AI experiences worldwide. The DEPART framework equips researchers and engineers with the statistical rigor and practical levers needed to close the parity gap—one language at a time.

Read the full study for a deeper dive: DEPART paper.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

DEPART: DEcomposing PARiTy across Multilingual LLMs

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

Carlos

Your Speaking Avatar

Pharmacy Admin Panel

Customer Relationship Management (CRM)

Service ERP

Calculate Time Complexity with ChatGPT API

AI Chatbot Starter Kit v0.1

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

Share

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password