✨ From vibe coding to vibe deployment. UBOS MCP turns ideas into infra with one message.

Learn more
Carlos
  • Updated: March 21, 2026
  • 6 min read

Hands‑On Guide: Creating and Running OpenClaw Agent Evaluation Jobs

OpenClaw agent evaluation jobs can be created, configured, executed, and analyzed entirely from the UBOS platform, giving developers a seamless end‑to‑end workflow without leaving their browser.

Introduction

OpenClaw is a lightweight framework for benchmarking AI agents against custom datasets and performance metrics. Its evaluation agents let you run repeatable tests, compare model versions, and surface regressions automatically. This hands‑on guide is written for technical developers and data scientists who already have a basic understanding of OpenClaw and want to leverage the UBOS platform to orchestrate evaluation jobs at scale.

By the end of this article you will be able to:

  • Set up the required environment and UBOS account.
  • Create a new OpenClaw evaluation job through the UI and via JSON/YAML.
  • Customize parameters, datasets, and metrics with code snippets.
  • Submit, monitor, and troubleshoot jobs using the UBOS CLI/API.
  • Extract logs, visualize results, and iterate on your agent design.

Prerequisites

Required tools and environment

Before diving in, make sure you have the following installed on your workstation:

  • Python ≥ 3.9 (recommended 3.11)
  • pip and virtualenv
  • Git ≥ 2.30
  • Docker ≥ 20.10 (for containerized OpenClaw agents)
  • UBOS CLI – install with pip install ubos-cli

Access to the UBOS platform

You need an active UBOS account. If you haven’t signed up yet, visit the UBOS homepage and create a free developer tier. Once logged in, you’ll see the UBOS platform overview dashboard where you can manage projects, view pricing, and access the Workflow automation studio.

Creating an Evaluation Job

Step‑by‑step UI walkthrough

  1. Log in to UBOS and navigate to Projects → New Project. Name it OpenClaw‑Eval‑Demo.
  2. Inside the project, click Add Integration → OpenClaw. The UI will prompt you to select a pre‑built OpenClaw agent template or upload your own Docker image.
  3. Choose the AI Article Copywriter template as a quick start – it already contains a sample evaluation script.
  4. Click Create Evaluation Job. A modal appears where you can paste a JSON or YAML configuration (see next section).
  5. Save the job; it now appears in the Jobs list with a unique identifier (e.g., job-7f3b9c).

Sample JSON configuration

{
  "job_name": "OpenClaw‑Agent‑Benchmark‑v1",
  "agent_image": "docker.io/ubos/openclaw-agent:latest",
  "datasets": [
    {
      "name": "qa‑benchmark‑v2",
      "uri": "s3://ubos-datasets/qa-benchmark-v2.json"
    }
  ],
  "metrics": ["accuracy", "latency", "token_usage"],
  "environment": {
    "PYTHONPATH": "/app/src",
    "MAX_CONCURRENCY": 4
  },
  "timeout_seconds": 7200
}

Equivalent YAML configuration

job_name: OpenClaw-Agent-Benchmark-v1
agent_image: docker.io/ubos/openclaw-agent:latest
datasets:
  - name: qa-benchmark-v2
    uri: s3://ubos-datasets/qa-benchmark-v2.json
metrics:
  - accuracy
  - latency
  - token_usage
environment:
  PYTHONPATH: /app/src
  MAX_CONCURRENCY: 4
timeout_seconds: 7200

Configuring the Job

Setting parameters, datasets, and metrics

OpenClaw’s configuration schema is deliberately flat to keep jobs reproducible. Below are the most common knobs you’ll adjust:

  • agent_image: Docker image containing your model and inference script.
  • datasets: One or more JSON/CSV files hosted on S3, GCS, or UBOS’s built‑in storage.
  • metrics: Built‑in metrics (accuracy, latency, token_usage) or custom Python callbacks.
  • environment: Runtime variables that your container reads at start‑up.
  • timeout_seconds: Hard limit for the entire job; useful for CI pipelines.

Code snippet for custom configuration

If you need to compute a bespoke metric (e.g., BLEU‑score for translation agents), add a Python callback in your Docker image and reference it in the JSON:

{
  "custom_metrics": [
    {
      "name": "bleu_score",
      "module": "metrics.bleu",
      "function": "compute_bleu"
    }
  ]
}

Inside metrics/bleu.py you would implement compute_bleu(predictions, references). UBOS will automatically import the module and aggregate the results.

Running the Job

Submitting the job via UBOS CLI

Open a terminal, activate your virtual environment, and run:

ubos login   # authenticate
ubos job create --project OpenClaw-Eval-Demo --config ./job-config.yaml

The CLI returns a job ID. You can also submit via the REST API:

curl -X POST https://api.ubos.tech/v1/jobs \
  -H "Authorization: Bearer $UBOS_TOKEN" \
  -H "Content-Type: application/json" \
  -d @job-config.json

Monitoring progress

UBOS provides three real‑time views:

  • Dashboard: Visual progress bar and status badge (Queued, Running, Completed, Failed).
  • Logs: Stream stdout/stderr from the container directly in the UI.
  • Metrics Explorer: Auto‑generated charts for each metric defined in the config.

For CLI users, the following command tails logs:

ubos job logs --id job-7f3b9c --follow

Analyzing Results

Accessing logs and metrics

When the job finishes, UBOS stores a JSON artifact under gs://ubos-artifacts/<job-id>/results.json. Download it with:

ubos artifact download --job job-7f3b9c --path results.json

Visualizing output with example scripts

Below is a minimal Python script that loads the results and creates a Matplotlib bar chart for latency and accuracy:

import json
import matplotlib.pyplot as plt

with open('results.json') as f:
    data = json.load(f)

metrics = data['metrics']
labels = list(metrics.keys())
values = [metrics[l]['value'] for l in labels]

plt.figure(figsize=(8,4))
plt.bar(labels, values, color=['#4F46E5', '#10B981'])
plt.title('OpenClaw Evaluation Summary')
plt.ylabel('Score')
plt.show()

The chart can be embedded directly into a UBOS Web app editor on UBOS dashboard for stakeholder reporting.

Troubleshooting Tips

Common errors and how to resolve them

ErrorCauseFix
ImagePullBackOffDocker image not found or private registry credentials missing.Verify the agent_image tag and add a registry_secret in the job config.
TimeoutErrorJob exceeded timeout_seconds or dataset too large.Increase the timeout or split the dataset into smaller shards.
MetricNotFoundCustom metric module path incorrect.Check the Python import path inside the Docker image; ensure the file is packaged.

Debugging strategies

  • Use ubos job logs --id <job-id> --tail 200 to fetch the last 200 lines.
  • Enable debug mode in your container by setting ENV DEBUG=1 in the Dockerfile.
  • Run the same configuration locally with docker run --rm -v $(pwd):/data <image> to isolate container‑level issues.

Conclusion

Running OpenClaw evaluation jobs on UBOS transforms a traditionally manual benchmarking process into a repeatable, observable, and collaborative workflow. You now have a complete pipeline—from environment setup to result visualization—ready to be integrated into CI/CD pipelines or product dashboards.

Next steps you might consider:

References

Ready to accelerate your AI agent testing? Sign up for a free UBOS account today and start building evaluation jobs in minutes.


Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Sign up for our newsletter

Stay up to date with the roadmap progress, announcements and exclusive discounts feel free to sign up with your email.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.