- Updated: June 22, 2026
- 8 min read
Pruning and Distilling Mixture-of-Experts into Dense Language Models
Direct Answer
The paper introduces a systematic framework that converts a trained Mixture‑of‑Experts (MoE) language model into a conventional dense architecture by scoring, selecting, and grouping experts, then stitching them into a single feed‑forward network and refining the result with knowledge distillation. This matters because it unlocks the performance‑to‑efficiency ratio of frontier MoE models for memory‑constrained deployments without sacrificing downstream accuracy.

Background: Why This Problem Is Hard
Mixture‑of‑Experts has become the de‑facto backbone of the largest language models. By routing each token to a small subset of specialized expert sub‑networks, MoE architectures achieve scaling laws that dense models struggle to match. However, the very mechanism that drives their performance also creates a deployment bottleneck:
- Memory footprint: All expert parameters must be resident in RAM or GPU memory, even if a given token only activates a few of them.
- Runtime complexity: Dynamic routing adds latency and complicates inference pipelines, especially on edge devices or in multi‑tenant cloud environments.
- Tooling gaps: Existing compression techniques (e.g., pruning, quantization) can shrink the number of experts but leave the model in an MoE form, preserving the same memory‑loading requirement.
For enterprises that need to run large language models on commodity hardware, these constraints translate into higher infrastructure costs and limited scalability. The research community has therefore been searching for a way to retain MoE‑level quality while delivering a dense, “plug‑and‑play” model that fits within standard memory budgets.
What the Researchers Propose
The authors present a four‑stage conversion pipeline that systematically transforms an MoE model into a dense counterpart:
- Expert Scoring: Each expert is evaluated using a set of metrics that capture both its individual contribution to the teacher MoE and its diversity relative to other experts.
- Selection & Grouping: Based on the scores, a subset of experts is chosen and then clustered into groups that will later share a common feed‑forward block.
- Concatenation into a Dense FFN: The grouped experts are flattened and concatenated, forming a single, wide feed‑forward network that replaces the sparse routing logic.
- Knowledge Distillation: The dense model is fine‑tuned by distilling the outputs of the original MoE teacher over several billion tokens, aligning its predictions with the expert ensemble.
This framework is deliberately modular. Researchers can swap in different scoring functions, grouping algorithms, or magnitude‑scaling strategies, enabling a thorough exploration of the design space.
How It Works in Practice
Conceptual Workflow
The conversion process can be visualized as a pipeline:
- Data Collection: A validation set is passed through the MoE teacher to record per‑expert activation statistics and output logits.
- Scoring Phase: Seven scoring methods (including the novel diversity‑aware score) are applied to rank experts by usefulness and uniqueness.
- Selection Phase: A target expert count (e.g., 4, 8, 12) is specified; the top‑ranked experts are retained.
- Grouping Phase: Five clustering strategies (e.g., k‑means on activation vectors, hierarchical agglomeration) partition the selected experts into groups that will share a dense sub‑layer.
- Concatenation Phase: Within each group, the weight matrices of the experts are concatenated along the hidden dimension, producing a single dense feed‑forward block.
- Distillation Phase: The dense model is trained on ~4 B tokens, using a combination of cross‑entropy loss against ground‑truth labels and a Kullback‑Leibler divergence term that forces the student to mimic the teacher’s softened logits.
Component Interactions
Each component communicates through well‑defined data structures:
- Expert Profiles: JSON‑like objects containing activation frequency, loss contribution, and embedding vectors.
- Score Matrix: A 2‑D array where rows represent experts and columns represent scoring criteria; the final ranking is a weighted sum.
- Group Assignments: A mapping from expert IDs to group IDs, used to drive the concatenation logic.
- Distillation Buffer: Stores teacher logits for each token, enabling efficient batch‑wise KL loss computation.
What Sets This Approach Apart
Prior work on MoE compression typically stops at “expert pruning,” leaving the model in a sparse state. The novelty here lies in the end‑to‑end dense reconstruction, coupled with a rigorous evaluation of scoring and grouping strategies. By treating the dense FFN as a learned composition of the most valuable experts, the method preserves the expressive power of the original MoE while eliminating the routing overhead.
Evaluation & Results
Experimental Setup
The authors benchmarked their pipeline on three state‑of‑the‑art MoE models:
- Qwen3‑30B‑A3B
- DeepSeek‑V2‑Lite
- GPT‑OSS‑20B
For each model they explored 350 configurations, varying:
- Scoring method (7 variants)
- Grouping algorithm (5 variants)
- Magnitude scaling (2 variants)
- Number of selected experts (multiple counts)
Downstream performance was measured across a suite of standard NLP benchmarks (e.g., MMLU, GSM‑8K, TruthfulQA). All dense students were trained for roughly 4 B tokens, which the authors note takes about 1.6× less wall‑clock time than a comparable dense‑to‑dense pruning baseline.
Key Findings
- Scoring dominates: The diversity‑aware scoring method consistently outperformed all other metrics, delivering an average gain of +3.8 percentage points in downstream accuracy.
- Dense‑vs‑dense baseline: When matched for total parameter count, the MoE‑to‑dense students beat traditional dense pruning by +6.3 pp on average, confirming that the expert‑selection process adds value beyond raw size.
- Training efficiency: The distillation stage required 40 % fewer GPU‑hours than training a dense model from scratch, thanks to the strong teacher signal.
- Robustness across models: Gains were observed not only on the flagship Qwen3‑30B‑A3B but also on the smaller DeepSeek‑V2‑Lite and GPT‑OSS‑20B, indicating that the framework scales downwards.
Why the Results Matter
These outcomes demonstrate that a carefully curated subset of MoE experts can be re‑engineered into a dense network that retains most of the original model’s capabilities. For practitioners, this means:
- Access to MoE‑level performance on hardware that only supports dense inference.
- Reduced operational costs due to lower memory consumption and faster inference.
- A reproducible pipeline that can be applied to any future MoE release, future‑proofing deployment strategies.
Why This Matters for AI Systems and Agents
From a systems‑engineering perspective, the ability to replace an MoE with a dense counterpart reshapes several design decisions:
- Agent orchestration: Dense models simplify the routing logic in multi‑agent pipelines, allowing agents to be swapped in and out without worrying about expert‑selection latency.
- Scalable inference services: Cloud providers can host more instances per GPU, improving throughput for high‑volume applications such as real‑time translation or conversational assistants.
- Edge deployment: Companies can now embed near‑state‑of‑the‑art language capabilities into on‑device products, expanding the reach of AI‑driven features.
Practically, teams building AI‑enhanced workflows can integrate the resulting dense models into existing platforms without redesigning their inference stack. For example, the UBOS platform overview already supports plug‑and‑play dense models; the MoE‑to‑dense conversion makes it possible to upgrade to higher‑quality models without exceeding the platform’s memory limits.
Moreover, the Workflow automation studio can now orchestrate richer language‑driven agents—such as AI marketing agents—with lower latency, enabling more responsive campaign generation and real‑time personalization.
What Comes Next
While the framework marks a significant step forward, several open challenges remain:
- Fine‑grained diversity metrics: Current scoring relies on activation statistics; future work could incorporate semantic diversity measured via representation similarity.
- Dynamic expert selection at inference: A hybrid approach that retains a small routing module could further boost efficiency for specific domains.
- Quantization synergy: Combining MoE‑to‑dense conversion with aggressive post‑training quantization may push memory footprints below 8 GB for 30B‑scale models.
- Broader benchmark coverage: Evaluating on multimodal tasks (e.g., vision‑language) would test the generality of the method.
Potential applications extend beyond pure NLP:
- Embedding the dense student into ChatGPT and Telegram integration to deliver high‑quality conversational agents on low‑cost hardware.
- Leveraging the dense model within Chroma DB integration for semantic search pipelines that require fast vector encoding.
- Deploying voice‑enabled assistants using the ElevenLabs AI voice integration, where reduced latency directly improves user experience.
In summary, the MoE‑to‑dense conversion framework opens a practical pathway for enterprises to harness the power of expert‑rich language models without the traditional memory and latency penalties. As the research community refines scoring strategies and explores hybrid routing, we can expect an even tighter convergence between model quality and deployment efficiency.
For a deeper dive into the methodology and full experimental tables, consult the original arXiv paper.