- Updated: March 11, 2026
- 7 min read
Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification
Direct Answer
The paper introduces Multimodal Mixture‑of‑Experts with Retrieval Augmentation (MERA), a novel framework that combines hierarchical retrieval‑augmented experts with a reliability‑aware fusion mechanism to pinpoint protein active sites at the residue level. By dynamically pulling in contextual information from related protein chains, sequences, and known active‑site patterns, MERA delivers state‑of‑the‑art accuracy while guarding against noisy or misleading modalities.
Background: Why This Problem Is Hard
Identifying the exact residues that constitute a protein’s active site is a cornerstone of functional annotation, enzyme engineering, and rational drug design. Yet the task remains stubbornly difficult for several intertwined reasons:
- Sparse training data. High‑resolution experimental annotations (e.g., X‑ray crystallography, cryo‑EM) cover only a fraction of the known proteome, leaving machine‑learning models to extrapolate from limited examples.
- Multimodal complexity. Active‑site determination benefits from diverse signals—3‑D structural geometry, evolutionary conservation, physicochemical properties, and ligand‑binding patterns. Integrating these modalities without diluting useful information is non‑trivial.
- Instance‑level variability. A single protein can present multiple functional pockets, and the same residue may be active in one context but inert in another, demanding fine‑grained, residue‑level predictions.
- Modality reliability. Not all data sources are equally trustworthy for every protein. For example, predicted structures may be high‑quality for some families but unreliable for others, and naïve fusion can let a weak modality dominate the decision.
Current approaches typically fall into two camps: (1) end‑to‑end deep models that ingest all modalities simultaneously, and (2) retrieval‑based methods that copy annotations from similar proteins. The former often suffers from over‑fitting to scarce data, while the latter lacks a principled way to weigh conflicting evidence from multiple modalities. Consequently, performance plateaus, especially on challenging peptide‑binding site tasks.
What the Researchers Propose
MERA tackles the above challenges with a three‑pronged strategy:
- Hierarchical multi‑expert retrieval. Instead of a monolithic model, MERA maintains a pool of specialized experts—each trained on a distinct perspective (chain‑level, sequence‑level, active‑site‑level). When a query protein arrives, the system retrieves the most relevant experts from each hierarchy, effectively borrowing knowledge from proteins that share similar structural or evolutionary contexts.
- Residue‑level mixture‑of‑experts gating. For every residue, a lightweight gating network decides how much weight to assign to each retrieved expert, allowing the model to adapt its focus locally (e.g., emphasizing structural cues for a buried residue while leaning on conservation for a surface‑exposed one).
- Reliability‑aware fusion via Dempster‑Shafer theory. To prevent noisy modalities from corrupting the final prediction, MERA quantifies each modality’s trustworthiness using belief mass functions. Learnable discounting coefficients modulate these masses, and Dempster’s rule of combination yields a fused belief that reflects both evidence strength and uncertainty.
In essence, MERA behaves like a team of domain experts that first gather the most relevant case studies, then let each expert speak on a per‑residue basis, and finally reconcile their opinions through a mathem‑atically grounded trust framework.
How It Works in Practice
The operational pipeline can be broken down into four stages, illustrated below:

1. Input Encoding
The query protein is represented through three parallel streams:
- Structural stream – 3‑D coordinates encoded with a graph neural network.
- Sequence stream – Amino‑acid embeddings enriched by a transformer trained on massive protein databases.
- Active‑site stream – Sparse binary masks from known ligand‑binding residues, when available, processed by a lightweight CNN.
2. Hierarchical Retrieval
Each stream queries a dedicated index:
- Chain index – Retrieves proteins with similar overall folds.
- Sequence index – Finds homologs based on evolutionary distance.
- Active‑site index – Pulls examples where the same functional motif appears.
The top‑K matches from each index are passed to their corresponding expert networks, which have been pre‑trained on the retrieved subsets.
3. Residue‑Level Gating
A gating module examines the local context of each residue (its neighboring atoms, conservation score, etc.) and produces a softmax distribution over the retrieved experts. This distribution determines how much each expert contributes to the residue’s prediction.
4. Reliability‑Aware Fusion
Each modality outputs a belief vector (active‑site vs. non‑active). Dempster‑Shafer theory converts these vectors into belief masses, applies learnable discount factors that down‑weight uncertain modalities, and finally combines them into a single posterior probability per residue. The result is a calibrated confidence score that reflects both evidence and its reliability.
Evaluation & Results
MERA was benchmarked on two publicly available datasets:
- ProTAD‑Gen – A large collection of protein structures with experimentally validated active sites.
- TS125 – A curated set of peptide‑binding proteins, representing a particularly tough prediction scenario.
Key evaluation metrics included Area Under the Precision‑Recall Curve (AUPRC) and residue‑level F1 score. The experimental protocol followed standard cross‑validation splits, ensuring no overlap between training and test proteins.
Performance Highlights
- On ProTAD‑Gen, MERA achieved an AUPRC of **0.90**, surpassing the previous best (0.82) by a margin of 8 percentage points.
- For peptide‑binding site identification in TS125, MERA improved F1 from 0.68 to **0.77**, demonstrating robustness on a domain where modality noise is especially high.
- Ablation studies revealed that removing the retrieval component dropped AUPRC by 5 points, while disabling Dempster‑Shafer fusion caused a 3‑point decline, confirming that both innovations contribute materially.
Beyond raw scores, qualitative analysis showed that MERA could correctly flag active residues even when the structural model was low‑resolution, thanks to the reliability‑aware fusion that leaned on sequence conservation in those cases.
Why This Matters for AI Systems and Agents
From an AI engineering perspective, MERA exemplifies several design patterns that are increasingly valuable in complex, multimodal domains:
- Retrieval‑augmented generation. By pulling in external knowledge at inference time, MERA sidesteps the need for massive monolithic models, reducing compute costs and improving interpretability.
- Mixture‑of‑Experts gating at the granularity of individual tokens (residues). This fine‑grained routing enables agents to allocate resources dynamically, a principle that can be transplanted to language agents handling heterogeneous inputs.
- Evidence‑theoretic fusion. Dempster‑Shafer provides a principled way to handle uncertainty, which is directly applicable to autonomous agents that must fuse sensor data of varying reliability.
Practically, biotech firms can embed MERA into pipelines that screen millions of protein targets, confident that the model will self‑regulate when a modality (e.g., predicted structure) is dubious. This reduces false‑positive rates in downstream docking simulations, saving both time and experimental resources.
For developers building AI‑driven drug discovery platforms, MERA’s architecture aligns well with modular orchestration frameworks such as UBOS Orchestration, where each expert can be deployed as a micro‑service and the gating/fusion logic orchestrated at runtime.
What Comes Next
While MERA sets a new benchmark, several avenues remain open for exploration:
Limitations
- Dependency on retrieval indices. The quality of retrieved experts hinges on the comprehensiveness of the underlying databases. Rare or novel folds may still suffer from insufficient analogs.
- Scalability of gating. Per‑residue gating introduces overhead for very large proteins; future work could investigate hierarchical gating to amortize computation.
- Static discount learning. Current discount coefficients are learned during training and remain fixed at inference. Adaptive discounting based on real‑time confidence estimates could further improve robustness.
Future Research Directions
- Integrating large language model embeddings that capture functional annotations from scientific literature, expanding the modality space beyond structure and sequence.
- Extending the retrieval mechanism to cross‑species databases, enabling transfer learning from well‑studied model organisms to understudied pathogens.
- Applying reinforcement learning to let the gating policy evolve through interaction with downstream tasks such as virtual screening or enzyme design.
Potential Applications
- Real‑time active‑site annotation in cloud‑based protein design suites.
- Automated triage of candidate targets in early‑stage drug discovery pipelines.
- Enhanced interpretability tools for structural biologists, where the belief masses can be visualized alongside electron density maps.
Developers interested in prototyping MERA‑style pipelines can leverage UBOS Platform to spin up retrieval services and expert micro‑models, then connect them through a unified API. For teams focused on agent orchestration, the UBOS Agents framework offers ready‑made patterns for dynamic routing and evidence fusion.
References
For a complete technical description, see the original preprint: Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification.