- Updated: June 25, 2026
- 6 min read
Building Agent Harnesses for Scientific Curation from Multimodal Sources
Direct Answer
The paper introduces Beaver, an agent harness that extracts structured, provenance‑rich information from scientific papers by integrating multimodal evidence (text, tables, figures) with a staged, auditable workflow. It matters because it raises the ceiling for automated literature curation, turning fragmented, dense research artifacts into reliable, machine‑readable records that can be trusted by downstream AI systems.
Background: Why This Problem Is Hard
Scientific discovery increasingly depends on the ability to synthesize findings across dozens—or hundreds—of papers. Human curators spend weeks extracting key attributes, normalizing units, and linking evidence across text, tables, and figures. Existing AI agents excel at single‑span extraction or simple question answering, but they stumble when the target information is:
- Distributed: Critical data lives in separate sections—methods in a paragraph, results in a table, and a crucial diagram in a figure.
- Dense: Academic prose is laden with domain‑specific jargon, abbreviations, and implicit assumptions that require domain knowledge to resolve.
- Cross‑modal: Accurate curation often demands reasoning that combines visual patterns (e.g., a heatmap) with textual descriptions.
- Auditable: Researchers need to trace each curated attribute back to its original evidence to verify correctness and comply with reproducibility standards.
Current pipelines either ignore non‑textual evidence, produce shallow copies of source sentences, or lack a systematic way to capture provenance. As a result, curated databases contain gaps, inconsistencies, and unverifiable entries—limitations that hinder large‑scale meta‑analyses, drug repurposing pipelines, and AI‑driven hypothesis generation.
What the Researchers Propose
Beaver reframes scientific curation as a harnessed workflow rather than a single monolithic model. The framework consists of three tightly coupled layers:
- Frontier Agent: A language model that orchestrates the overall task, decides which evidence modality to query next, and assembles the final record.
- Multimodal Evidence Tooling: Specialized sub‑agents that can parse tables, interpret figures, and extract text spans, each returning a structured representation together with a provenance token.
- Task Scaffolding & Artifact‑Grounded Autoresearch: A set of prompts, validation checks, and iterative loops that guide the frontier agent through evaluate‑diagnose‑revise cycles, exposing failures at the granularity of individual stages.
By separating concerns—decision making, evidence retrieval, and validation—Beaver makes it possible to plug in better table readers or figure interpreters without redesigning the whole system. The harness also records every interaction as an artifact, enabling a transparent audit trail.
How It Works in Practice
The operational flow can be visualized as a four‑stage pipeline:

Stage 1: Task Definition & Scaffolding
The user supplies a high‑level curation schema (e.g., “extract drug dosage, efficacy metric, and patient cohort size”). Beaver translates this schema into a series of micro‑tasks, each annotated with expected evidence types.
Stage 2: Evidence Retrieval
For each micro‑task, the frontier agent dispatches a request to the appropriate multimodal tool:
- Text Extractor pulls sentences from the methods and results sections.
- Table Parser converts CSV‑like structures into key‑value pairs, handling merged cells and scientific notation.
- Figure Interpreter runs a vision model to detect axes, legends, and data points, then maps them to numeric values.
Every piece of retrieved data is tagged with a provenance identifier that points back to the exact page, region, and modality.
Stage 3: Evaluation & Diagnosis
The harness runs validation scripts (unit checks, range checks, cross‑field consistency) on the assembled record. If a discrepancy is found—say, a dosage value that falls outside the reported range—the system flags the offending attribute and surfaces the underlying evidence.
Stage 4: Revision & Finalization
Armed with diagnostic feedback, the frontier agent revisits the problematic micro‑task, possibly selecting an alternative evidence source (e.g., a supplementary table) or re‑prompting the language model for clarification. Once all checks pass, the curated record is emitted along with a complete provenance bundle.
What distinguishes Beaver from prior agents is the explicit, persistent artifact store that captures each stage’s output. This design enables a deterministic “evaluate‑diagnose‑revise” loop, turning what was previously a black‑box inference into a transparent, debuggable process.
Evaluation & Results
The authors benchmarked Beaver on a curated dataset of 1,200 biomedical papers covering drug trials, genomics, and materials science. The primary metric, Gold‑Referenced Attribute Score (GRAS), measures agreement between the system’s output and a gold‑standard curated record at the attribute level.
- Baseline agents (single‑modal LLMs with no provenance) achieved an average GRAS of 57.8.
- Beaver reached 81.0, a 23‑point absolute improvement.
- Ablation studies showed:
- Removing task scaffolding dropped GRAS to 71.2.
- Disabling multimodal tooling reduced performance to 68.5.
- Eliminating provenance traces lowered the score to 73.0.
- Attribute‑level analysis revealed the biggest gains on high‑value fields that required cross‑modal reasoning, such as “dose‑response curve parameters” and “material tensile strength” extracted from combined figure‑text evidence.
These results demonstrate that a harness‑centric design not only boosts raw extraction accuracy but also delivers reliable traceability—an essential requirement for scientific databases and downstream AI models that depend on trustworthy inputs.
Why This Matters for AI Systems and Agents
For practitioners building AI‑driven research assistants, data pipelines, or knowledge graphs, Beaver offers a blueprint for turning fragmented scientific literature into clean, actionable data:
- Modular extensibility: Teams can swap in a better OCR engine or a domain‑specific figure parser without rewriting orchestration logic.
- Auditable pipelines: Provenance bundles satisfy compliance standards for reproducibility, a growing concern in regulated sectors like pharma and materials engineering.
- Iterative debugging: The evaluate‑diagnose‑revise loop mirrors human curation workflows, reducing the need for costly post‑hoc manual correction.
- Scalable orchestration: By treating each micro‑task as a discrete artifact, the harness can be parallelized across compute clusters, accelerating large‑scale literature reviews.
Organizations looking to embed scientific intelligence into their products can leverage these principles on platforms such as the UBOS platform overview, where workflow automation and multimodal data handling are already baked into the service stack.
What Comes Next
While Beaver marks a significant step forward, several avenues remain open for research and product development:
- Domain adaptation: Fine‑tuning multimodal sub‑agents on niche corpora (e.g., crystallography images) could further improve cross‑modal reasoning.
- Human‑in‑the‑loop interfaces: Integrating a lightweight UI for curators to approve or override provenance decisions would blend automation with expert oversight.
- Open‑source tooling: Publishing the harness components as modular SDKs would accelerate community contributions and benchmark reproducibility.
- End‑to‑end integration: Embedding Beaver within a Workflow automation studio could enable non‑technical users to define curation schemas via drag‑and‑drop, democratizing access to high‑quality scientific data.
Future work may also explore self‑supervised pretraining on multimodal scientific corpora, allowing the frontier agent to anticipate which evidence modality is most informative for a given attribute—a capability that could shrink the number of required micro‑tasks and further reduce latency.