- Updated: June 26, 2026
- 7 min read
Counsel: A Meta-Evaluation Dataset for Agentic Tasks
Direct Answer
Counsel is the first publicly released meta‑evaluation dataset that records how open‑weight “LLM‑as‑judge” models critique the step‑by‑step reasoning of AI agents and how humans judge those critiques. By providing fine‑grained labels for error location and reasoning quality, Counsel makes it possible to calibrate, improve, and even train better LLM judges for complex, multi‑step agentic tasks.
Background: Why This Problem Is Hard
Modern AI agents—customer‑support bots, code‑generation assistants, autonomous planners—are no longer single‑turn predictors. They must execute long, branching sequences of actions, often interacting with external tools or APIs. Evaluating such trajectories is a two‑fold challenge:
- Human annotation bottleneck: Scoring a single agent run on benchmarks like tau‑bench or DA‑Code can require hours of careful reading, error spotting, and reasoning verification.
- Scalability of automated judges: Researchers increasingly rely on “LLM‑as‑judge” (LLMJ) systems to automatically critique agents, but the reliability of those critiques is rarely quantified. Without a ground truth, it is impossible to know whether a judge’s feedback is trustworthy or systematically biased.
Existing evaluation pipelines either accept noisy LLMJ scores as‑is or invest massive human labor to create gold‑standard annotations—both unsustainable as agentic workloads explode. A systematic, publicly available meta‑evaluation resource is therefore essential for the next generation of trustworthy AI agents.
What the Researchers Propose
The authors introduce Counsel, a meta‑evaluation dataset that captures three layers of information:
- Agent trajectories: Full process logs from two established benchmarks—tau‑bench (customer‑support dialogues) and DA‑Code (step‑wise programming tasks).
- LLMJ critiques: Process‑level feedback generated by several open‑weight LLM judges, each asked to flag errors, locate them within the trajectory, and provide reasoning.
- Human meta‑evaluations: Annotators assess each LLMJ critique on a three‑point scale—“spot on,” “correct location but poor reasoning,” or “should not have flagged.”
This three‑tiered design lets researchers measure two orthogonal dimensions of judge quality: location accuracy (did the judge point to the right step?) and reasoning fidelity (is the justification sound?). By stratifying critiques along these axes, Counsel becomes a diagnostic tool for calibrating LLM judges.
How It Works in Practice
The Counsel pipeline can be broken down into four conceptual stages:
1. Trajectory Collection
Researchers run agents on the two benchmarks, recording every intermediate observation, tool call, and final output. These logs constitute the raw material for evaluation.
2. Judge Generation
Open‑weight LLMs (e.g., Llama‑2‑70B, Mistral‑7B) are prompted to act as critics. For each trajectory, the judge produces a list of flagged errors, each accompanied by a short rationale. The authors vary the amount of “reasoning effort” (e.g., chain‑of‑thought prompting vs. single‑shot) to study its impact.
3. Human Meta‑Annotation
Trained annotators review every judge‑generated critique. They assign one of three labels:
- Spot on: The error is correctly identified and the reasoning is sound.
- Correct location but poor reasoning: The judge points to the right step but offers a weak or partially incorrect justification.
- Should not have flagged: The judge either mis‑identifies a correct step as erroneous or flags something irrelevant.
Inter‑annotator agreement reaches a Krippendorff’s α of 0.78, indicating reliable human consensus.
4. Dataset Publication
The final artifact bundles raw trajectories, judge outputs, and human meta‑labels under a permissive license, enabling anyone to replay the evaluation, fine‑tune new judges, or benchmark alignment techniques.
Evaluation & Results
The authors conduct a series of controlled experiments to answer two core questions: (1) How does judge model capability affect alignment with human meta‑evaluations? and (2) Does prompting for more reasoning improve that alignment?
Key Findings
- Model capability matters: The most capable open‑weight judge (a 70B parameter model) achieved roughly 88 % agreement on error location and 65 % agreement on reasoning quality, outperforming smaller models by 12–20 % points.
- Reasoning effort pays off: Judges prompted with chain‑of‑thought style reasoning consistently outperformed single‑shot prompts, narrowing the gap to human judgments by up to 8 % for reasoning quality.
- Error type distribution: Across both benchmarks, the majority of disagreements stemmed from “poor reasoning” rather than outright mis‑location, suggesting that future work should focus on improving the explanatory component of LLM judges.
- Cross‑benchmark consistency: Performance trends held for both customer‑support and coding tasks, indicating that the observed patterns are not domain‑specific.
These results demonstrate that Counsel can reliably differentiate between high‑ and low‑quality judges, providing a quantitative foundation for systematic improvement.
Why This Matters for AI Systems and Agents
For practitioners building agentic products, reliable evaluation is as critical as model training. Counsel addresses three practical pain points:
- Rapid feedback loops: Instead of waiting hours for human review, developers can run an LLM judge, compare its output against Counsel’s human‑aligned baseline, and instantly gauge whether the judge is trustworthy enough for production use.
- Calibration of automated judges: By fine‑tuning a judge on the Counsel meta‑labels, teams can produce evaluators that are demonstrably aligned with human reasoning, reducing the risk of “over‑confident” critiques that mislead downstream pipelines.
- Alignment research acceleration: Counsel offers a standardized benchmark for the AI alignment community to test new prompting strategies, reward models, or reinforcement‑learning‑from‑human‑feedback (RLHF) pipelines.
These capabilities translate directly into more robust AI assistants, safer autonomous tools, and higher‑quality data for continuous learning. Companies looking to embed AI agents into customer‑facing workflows can leverage Counsel‑aligned judges to automatically flag problematic steps before they reach end users.
Explore how AI marketing agents can benefit from automated, human‑aligned evaluation, or see the impact on large‑scale deployments via the Enterprise AI platform by UBOS. For teams focused on orchestrating complex workflows, the Workflow automation studio provides built‑in hooks for integrating LLM judges trained on Counsel.
What Comes Next
While Counsel marks a significant step forward, several open challenges remain:
- Broader domain coverage: Current data spans customer support and code generation. Extending meta‑evaluation to domains such as scientific reasoning, robotics, or multimodal agents would test the generality of LLM judges.
- Dynamic evaluation: Agents often adapt in real time. Future datasets could capture live interaction loops where judges must evaluate not only static trajectories but also evolving policies.
- Fine‑grained reasoning metrics: The three‑point annotation scheme is a solid start, but richer taxonomies (e.g., logical fallacy types, causal chain errors) could enable more precise calibration.
- Integration with RL pipelines: Embedding Counsel‑aligned judges as reward models in reinforcement learning could close the loop between evaluation and policy improvement.
Developers interested in contributing new benchmarks or extending the dataset can start by reviewing the UBOS platform overview, which offers tools for data ingestion, annotation, and model serving. Start‑ups looking to prototype next‑generation agents can also leverage the UBOS for startups program to access compute resources and community support.
Conclusion
Counsel delivers the first large‑scale, open‑weight meta‑evaluation resource for agentic tasks, pairing LLM‑generated critiques with reliable human judgments. By quantifying both error location and reasoning quality, it equips researchers and product teams with a concrete yardstick for calibrating LLM judges. The dataset’s permissive licensing and benchmark‑agnostic design invite community contributions, promising a virtuous cycle of better evaluators and more trustworthy AI agents.
For a deeper dive into the methodology and to download the full dataset, visit the UBOS homepage. The original research paper is available on arXiv.
