✨ From vibe coding to vibe deployment. UBOS MCP turns ideas into infra with one message.

Learn more
Andrii Bidochko
  • Updated: June 12, 2026
  • 19 min read

How to Evaluate AI Agent Quality: The Complete Guide for Hermes and UBOS Deployments (2026)

UBOS Partner Spotlight

This article was co-authored with our partner BenchGen — the AI agent evaluation and fine-tuning platform purpose-built for Hermes deployments.

What BenchGen is: BenchGen evaluates AI agents in interactive environments, captures full decision trajectories through the Atropos RL framework, and turns every benchmark run into training data. It is the only evaluation product that closes the loop from production trajectories → quality scoring → LoRA fine-tuning → improved agent — what the team calls the eval-to-train loop.

What the product does: Run pip install benchgen && benchgen scan against any Hermes deployment and receive a quality report in under five minutes — an overall 0–100 score, per-dimension breakdowns across tool-call accuracy, goal completion, error recovery, skill coverage, and memory utilisation, and a ranked list of your top failure patterns with specific trajectory examples. Free tier, no account required for the first scan.

Why UBOS partnered with BenchGen: UBOS builds the infrastructure for enterprise AI agent deployments. BenchGen measures whether those agents are working well enough to trust in production. Together, the workflow is complete: deploy on UBOS, evaluate and improve with BenchGen, fine-tune on UBOS’s BYOC GPU infrastructure. Teams that combine both tools consistently move agents from internal experiment to approved production system faster than teams relying on either alone. Read the Hermes-specific setup guide →

Quick answer: AI agent quality evaluation measures five dimensions — tool-call accuracy, goal completion rate, error recovery rate, skill coverage, and memory utilisation — across recorded decision trajectories. A new deployment typically scores 65–70 out of 100. After 30 days of skill accumulation it reaches ~80. After a targeted fine-tuning cycle it can reach 89–94. The gap between those numbers is the difference between a demo and a production system.

If you have deployed an AI agent on UBOS or are running Hermes Agent for enterprise automation, you have probably noticed the same thing every team notices around week three: the agent works, mostly, but inconsistently. It handles five tasks in a row, then fails on the sixth for reasons you cannot immediately explain. You tweak a prompt. It gets better. Then something else breaks.

This is not a bug. It is the absence of measurement. You cannot systematically improve something you are not systematically measuring. This guide explains exactly how to measure AI agent quality, what the numbers mean, and how to use them to move an agent from a convincing demo into a reliable production system.



Why 88% of AI agent pilots never reach production

The statistic is cited in every enterprise AI conversation in 2026: 88% of agent pilots never reach production. Forrester and Anaconda research first surfaced the number, and independent surveys by a16z and the MIT Sloan CIO panel have since replicated it. The single largest reason — named by 64% of enterprises as the leading blocker — is evaluation and observability. Teams cannot tell whether their agent is working well enough to trust.

There is also a 37% gap between lab benchmark scores and real-world deployment performance. An agent that looks excellent in a controlled demonstration degrades unpredictably on real traffic, because benchmarks test a narrow, curated distribution and production does not. The only way to close that gap is to evaluate the agent on actual production tasks, with actual production inputs, over enough time to see the distribution it will really face.

The teams whose agents do reach production share a consistent profile. They measure continuously, not just before launch. They distinguish between the types of failure and route fixes to the right layer. And they treat evaluation output as training input — not just a dashboard.

UBOS’s multi-agent platform makes it straightforward to connect agents to enterprise systems. Evaluation is what determines whether those connections are reliable enough to stake real workflows on.


The five dimensions of agent quality

A single quality score is useful for tracking trends, but it obscures diagnosis. Evaluation needs to decompose into dimensions that map to specific failure causes. The five dimensions used in production agent evaluation are:

1. Tool-call accuracy

Tool-call accuracy is the percentage of agent tool invocations that use the correct tool name, parameter format, and valid parameter values on the first attempt — without requiring a retry or correction. It is the primary cost driver: each failed call adds a retry turn to the trajectory, increasing inference spend. New Hermes deployments on domain-specific tasks typically start at 67% tool-call accuracy. After 30 days of skill accumulation this reaches approximately 80%. After a fine-tuning cycle on scored trajectories, well-maintained deployments reach 89–94%.

For ERP-connected agents — Odoo, SAP, or similar — malformed domain filter syntax accounts for over half of all tool-call failures in new deployments. The agent knows which tool to call; it passes the parameters incorrectly.

2. Goal completion rate

Goal completion rate measures the percentage of tasks where the agent accomplished the stated objective in a way a domain expert would approve. This is distinct from tool-call accuracy: an agent can call every tool correctly and still fail to complete the task — if it skipped a required step, generated output that missed the point, or completed a subtask but missed the actual goal.

Measuring goal completion requires more than trace replay. It requires either deterministic verifiers (the record was created with the required fields, the test passed) or LLM-as-judge scoring (a second model evaluates the final output against defined criteria). Most observability platforms stop at tracing and never answer this question.

3. Error recovery rate

Error recovery rate is the percentage of failed tool calls or task steps where the agent successfully diagnosed the error and completed the task via an alternative path, without human intervention. An agent that retries the same failed call with identical parameters is not recovering — it is looping. An agent that detects the failure, revises its approach, and completes the task has recovered.

This dimension is particularly sensitive to model quality. New deployments on general-purpose models typically show 45–55% error recovery. After targeted fine-tuning on high-quality recovery trajectories, this rises to 70–80%, because the model has internalised what a productive recovery looks like on the domain’s specific error types.

4. Skill coverage

Skill coverage measures the fraction of task types the agent encounters that have a relevant skill file loaded. An agent with low skill coverage is reasoning from scratch on every task — expensive, slow, and inconsistent. An agent with high coverage is loading a pre-written procedure for most tasks and executing reliably.

In Hermes deployments, skill files are stored as Markdown documents in ~/.hermes/skills/ and load automatically into the agent’s context window when it encounters a relevant task. After 30 days of active operation, a well-maintained deployment typically has skill coverage above 70% for recurring task types.

5. Memory utilisation

Memory utilisation measures whether the agent actually uses the persistent knowledge it has accumulated. An agent that re-asks the user for their name, re-discovers the database schema, or ignores stored user preferences on every run has poor memory utilisation — it is wasting turns and eroding trust. This dimension checks that MEMORY.md and USER.md content is being applied, not just stored.

Together, these five dimensions form the BenchGen Quality Score — a composite 0–100 metric that tracks agent health over time and flags regressions before users encounter them.


What agent trajectories are and why they are your raw material

Every quality metric described above is computed from trajectories. An Atropos trajectory is a JSONL record of every decision an agent made during one task run: each reasoning step, every tool call with its arguments, every tool result, and the final outcome. For Hermes deployments, the Atropos RL framework records these automatically to ~/.hermes/trajectories/.

Trajectories are the raw material of evaluation for three reasons. First, they are the only complete record of what the agent actually did — not what you expected it to do, not what a unit test covers, but the exact sequence of decisions the agent made on a real task with real inputs. Second, they are already being generated: every Hermes agent running in production is accumulating trajectory data. The data exists; measurement is a choice to use it. Third, trajectories are training data: the highest-quality trajectories in your corpus are also the best candidates for fine-tuning the model behind the agent.

The discipline that separates teams who progress from teams who stay stuck is treating trajectories as assets, not logs. Logs are for debugging incidents. Trajectories are for systematically improving an agent that is running every day.


The two-layer learning architecture: skills versus weights

When people say their Hermes agent “learns,” they are usually describing two completely different mechanisms that improve performance in different ways. Understanding the distinction is the prerequisite for routing fixes to the right layer.

The two-layer learning architecture separates agent improvement into:

Layer 1 — Skill and memory evolution

After completing a task, the agent writes a structured skill document describing the procedure it used, and updates its memory files with new facts and preferences. On the next similar task, these files load into the model’s context window, giving it explicit guidance. Layer 1 is instant, free, auditable (every skill is a readable Markdown file), and works with any model including closed-source APIs.

The analogy is a consultant’s notebook. The consultant gets better at your specific work without becoming a fundamentally smarter person. Layer 1 improves performance on known task types, for this specific deployment, with this specific user.

Layer 2 — LLM fine-tuning

Layer 2 updates the model’s weights using recorded Atropos trajectories. This changes what the model knows natively — correct tool-call formats, domain vocabulary, error recovery patterns — so it performs well even without skills loaded. It requires open-weight models (Llama 4, Qwen 3, Gemma 4, Hermes 3), GPU time, and runs offline in batches.

The analogy is the consultant doing an MBA. Slow, expensive, permanent, and transferable to every future engagement.

The critical practical question is which layer a failure belongs to. A failure that recurs after a correct skill exists is a weights problem — it lives in the model and only fine-tuning removes it. A failure on a task type with no skill is a coverage problem — creating the skill fixes it immediately. Evaluation must make this distinction, because routing fixes to the wrong layer wastes either GPU budget or weeks of skill iteration.

Layer 1 learning works with any model. Layer 2 requires open weights. UBOS’s platform supports both patterns: managed agent deployments benefit from skill accumulation immediately, and teams with BYOC infrastructure can apply full fine-tuning workflows on their own GPU capacity.


The eval-to-train loop: turning evaluation into improvement

The eval-to-train loop is a closed cycle where production agent trajectories are scored, filtered, and fed into fine-tuning — producing a better model that generates better trajectories. It is what turns evaluation from a measurement activity into an improvement mechanism.

The loop has four stages:

  1. Capture. The agent runs in production and Atropos records every task as a trajectory. A 30-day corpus for an active Hermes deployment typically accumulates 500–2,000 trajectories, depending on task volume.
  2. Score. An evaluation layer — running BenchGen‘s quality harness — scores each trajectory across the five quality dimensions. Each trajectory receives a composite score and per-dimension breakdowns.
  3. Filter. Trajectory filtering selects the top 60–70% by quality score. Failed runs, looping behaviour, and hallucinated tool calls are discarded. The filtered set is exported in ShareGPT format for compatibility with standard fine-tuning pipelines.
  4. Train. A LoRA fine-tuning run on the filtered trajectories produces a model adapter that improves tool-call accuracy, domain vocabulary, and error recovery. The adapter is deployed back to the Hermes endpoint. The cycle repeats.

Why is the filtering step critical? Because training on unfiltered production data — which includes every failure, every hallucinated tool name, every loop — degrades the model rather than improving it. The evaluation layer is not reporting overhead; it is the quality gate that makes production data safe to learn from.

In practice, a typical deployment shows the following progression:

  • Day 1 (base model, no skills): ~65% tool-call accuracy, ~60% goal completion
  • Day 30 (skills accumulated, Layer 1 active): ~80% tool-call accuracy, ~74% goal completion
  • After first eval-to-train cycle (Layer 2): ~91% tool-call accuracy, ~87% goal completion
  • Month 3+ (both layers compounding): 93–96% across primary task types

Each fine-tuning cycle compounds: a better model generates cleaner trajectories, which produce higher-quality training data, which produce a better next fine-tune. This is why evaluation and improvement are not sequential — they are the same loop.


The five most common failure patterns and how to fix them

Trajectory analysis across production Hermes deployments consistently surfaces the same five failure patterns. Each has a distinct cause and a distinct fix.

1. Argument format errors (Layer 2 problem)

The agent calls the correct tool but passes parameters in the wrong format. For ERP integrations, this is almost always domain filter syntax — passing a filter as a plain string instead of the expected data structure. A skill correction helps (it loads the right format into context), but if the model overrides the skill, the error lives in the weights and requires fine-tuning to eliminate.

Metric to watch: Argument correctness broken down by tool name. Concentrate fine-tuning data on the tools with the lowest per-tool argument correctness scores.

2. Loop-on-empty-result (skill problem)

The agent calls a search tool, receives zero results, and retries the identical call without modifying the query or parameters. This is a skill-level issue: the agent has no procedure for handling empty results. Create a skill that documents the correct recovery sequence — try alternative field names, broaden the filter, or report that no records exist.

Metric to watch: Error recovery rate. Loops appear as repeated identical calls in the trajectory, flagged automatically in evaluation.

3. Skill staleness after environment changes

Skills written when an API or schema was different now encode wrong field names, deprecated endpoints, or outdated step sequences. The agent follows the skill, the environment rejects the call, and the mismatch continues until someone manually reviews the skill library. The Hermes Skill Curator detects and flags stale skills based on failure signals from production trajectories.

Metric to watch: Skill coverage combined with tool-call failure rates on tasks that have associated skills. A skill exists but accuracy is low — likely stale.

4. Model-switch regression

Quality drops after switching to a smaller or cheaper model to reduce inference costs. Tool-call accuracy degrades first, usually 10–18 percentage points. The team discovers this from user complaints rather than from monitoring. The fix is establishing a regression baseline before the switch and running an automated comparison immediately after, using trajectory replay against the held-out test set.

Metric to watch: Delta on composite quality score within 24 hours of any model change. A drop of more than 5 points on the composite score, or more than 10 points on any single dimension, should trigger a rollback review.

5. Hallucinated tool names and field names

The agent invents tool names, API paths, or field names that do not exist in the connected system. This is distinct from passing wrong arguments to a real tool — hallucinated calls target endpoints that do not exist at all. Agent hallucination rate is detectable deterministically, since every real tool and schema is known. Fine-tuning on trajectories from the actual environment reliably reduces hallucination below 1% because real tool names enter the model weights.

Metric to watch: Hallucination rate per tool namespace (e.g. all calls to an Odoo MCP server versus all valid Odoo tools declared in the server’s schema). A high rate on a specific namespace often means that namespace’s tools are not represented in the model’s training data.


Evaluation for enterprise: certification and compliance

Individual quality metrics answer “is this agent working?” Enterprise deployment requires answering a harder question: “is this agent reliable enough for us to be accountable for it?”

That question has a specific artefact: agent certification. A certification is a formal, renewable attestation stating which evaluation suite was run, over what period, against what version of the agent and model, with what results, and when it must be re-validated.

Certification packages include:

  • The evaluation methodology and scenario set
  • The composite production readiness score with per-dimension breakdowns
  • Regression stability record across the evaluation window (how many model changes occurred, how the score moved)
  • Compliance mapping for regulated industries — data-handling checks, audit-trail completeness, KVKK or GDPR requirements for sovereign AI deployments
  • A stated validity window and the conditions that trigger re-certification (model switch, major skill library change, new MCP integration)

For industries like finance, healthcare, and government — where UBOS deployments frequently operate — certification is not bureaucracy. It is the artefact that moves a project from the innovation lab into production budgets. Without it, the agent exists in a permanent pilot state.

Vertical benchmark packs extend this further. A generic evaluation checks whether the agent calls tools correctly. A vertical pack checks whether it handles the domain’s specific requirements: closed accounting periods, tax field format validation, inventory unit conversions, or data-classification levels for government workflows. These encoded domain requirements are what make an evaluation report meaningful to a compliance officer rather than only to an engineer.


How to start evaluating your agent today

The practical path from “we have an agent running” to “we have a measured, improving agent” has three phases.

Phase 1 — Establish the baseline (Week 1)

Run your first evaluation scan on existing trajectory data. If you are running Hermes, Atropos has already been recording trajectories to ~/.hermes/trajectories/. Install BenchGen‘s evaluation CLI:

pip install benchgen
benchgen scan

BenchGen auto-detects the Hermes installation, reads the trajectory corpus, scores each record across the five quality dimensions, and produces a report identifying your top failure patterns with specific examples. First report takes under five minutes. No API keys, no account required for the initial scan.

The baseline number is not the point. The point is the failure pattern analysis: which tools are failing most, which skills are stale, and whether failures are concentrated in specific task types or distributed broadly. That analysis tells you exactly where to spend the next two weeks.

Phase 2 — Fix the highest-impact issues (Weeks 2–4)

Address failures in this order:

  1. Skill gaps and staleness — fastest to fix, immediate impact, no GPU required
  2. Argument format errors that recur after skill corrections — these are Layer 2 problems; flag the relevant trajectories for fine-tuning
  3. Recovery loops — create skills with explicit empty-result and error-handling procedures

Run a second scan after two weeks. The delta — quality score before versus after — gives you the first concrete measurement of improvement. Share that delta with whoever needs to see progress.

Phase 3 — Set up continuous monitoring (Month 2 onwards)

Move to automated weekly evaluation runs. Configure regression baselines so any model switch or major skill update triggers an automated comparison. When you have 500+ high-quality trajectories, run your first fine-tuning cycle. The before/after comparison — same held-out scenarios, old model versus fine-tuned adapter — is typically the artefact that moves a project from “engineering experiment” to “approved production system.”

For enterprise deployments through UBOS, this workflow integrates directly with UBOS’s agentic workflow tooling — evaluation runs can be triggered automatically on deployment events, and certification reports can be routed to the relevant stakeholders as part of the release pipeline.


Frequently asked questions

What is a good AI agent quality score?

A new agent running a general-purpose model on domain-specific tasks typically scores 65–72 out of 100 on the BenchGen Quality Score. A score above 80 after 30 days of operation indicates healthy Layer 1 learning (skills accumulating, memory being used). A score above 88 indicates the agent is a candidate for the first fine-tuning cycle. Scores above 92 on primary task types are achievable after one to two eval-to-train cycles and represent a strong production-ready baseline for enterprise use.

What is the difference between agent evaluation and agent observability?

Agent observability is recording what the agent did — every tool call, every reasoning step, every outcome. Evaluation is judging whether what it did was good. Most platforms do the first. The second requires defined success criteria, verifiers, and outcome scoring — and is where the 37% gap between benchmark scores and production performance hides.

How many trajectories do you need before fine-tuning?

A practical minimum is 300–500 filtered trajectories (after discarding low-quality runs). A 30-day corpus from an active Hermes deployment typically yields 500–2,000 raw trajectories, filtering down to 300–1,200 eligible training records. Smaller datasets can produce improvements, but the improvement is noisier and the risk of overfitting to a narrow task distribution is higher. Quality of training data matters more than volume: 400 clean, diverse, high-scoring trajectories outperform 2,000 unfiltered ones.

Does evaluation work with closed-source models like GPT-4 or Claude?

Yes — trajectory scoring and all five quality dimensions work regardless of which model powers the agent, including closed APIs. However, fine-tuning (Layer 2) requires open-weight models. A Hermes deployment on Claude can improve through skills and memory accumulation but cannot update the model’s weights. If fine-tuning is part of the roadmap, the model choice should be made accordingly: Llama 4, Qwen 3, Gemma 4, and Hermes 3 are the current primary targets.

What causes quality to drop after switching models?

Model switches are the most common trigger for agent regression. Tool-call accuracy degrades first, typically 10–18 points, because the new model’s native knowledge of the domain’s tool schemas differs from the previous model’s. Skills partially compensate by loading correct formats into context, but if the gap is large, fine-tuning the new model on existing high-quality trajectories is the reliable fix. Always establish a regression baseline before switching and run an automated replay comparison within 24 hours of the change.


Conclusion

AI agent evaluation is not a one-time gate before launch. It is the continuous practice that separates the 12% of agents that reach production from the 88% that stay in perpetual pilot. The raw material — trajectories — is already being generated by every Hermes agent in production. The choice is whether to use it.

The eval-to-train loop closes the most important gap in agent development: between knowing an agent failed and improving it systematically. Evaluation scores tell you what is wrong. Filtered trajectories tell you what good looks like. Fine-tuning makes good the default.

For teams building on UBOS’s enterprise agent platform, the workflow is concrete: deploy the agent, let Atropos accumulate trajectories, run BenchGen to score and filter, fine-tune on high-quality data, measure the improvement, and repeat. The agent that exists in month three is not the agent you deployed in month one. That progression is not accidental — it is the product of measurement.

Start with one scan. Visit benchgen.com/hermes for the Hermes-specific setup guide, or read the eval-to-train loop definition for the conceptual foundation. First quality report in under five minutes — no account required.


Andrii Bidochko

CEO/CTO at UBOS

Welcome! I'm the CEO/CTO of UBOS.tech, a low-code/no-code application development platform designed to simplify the process of creating custom Generative AI solutions. With an extensive technical background in AI and software development, I've steered our team towards a single goal - to empower businesses to become autonomous, AI-first organizations.

Sign up for our newsletter

Stay up to date with the roadmap progress, announcements and exclusive discounts feel free to sign up with your email.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.