- Updated: March 21, 2026
- 6 min read
Hands‑On Guide: Creating and Running OpenClaw Agent Evaluation Jobs
OpenClaw agent evaluation jobs can be created, configured, executed, and analyzed entirely from the UBOS platform, giving developers a seamless end‑to‑end workflow without leaving their browser.
Introduction
OpenClaw is a lightweight framework for benchmarking AI agents against custom datasets and performance metrics. Its evaluation agents let you run repeatable tests, compare model versions, and surface regressions automatically. This hands‑on guide is written for technical developers and data scientists who already have a basic understanding of OpenClaw and want to leverage the UBOS platform to orchestrate evaluation jobs at scale.
By the end of this article you will be able to:
- Set up the required environment and UBOS account.
- Create a new OpenClaw evaluation job through the UI and via JSON/YAML.
- Customize parameters, datasets, and metrics with code snippets.
- Submit, monitor, and troubleshoot jobs using the UBOS CLI/API.
- Extract logs, visualize results, and iterate on your agent design.
Prerequisites
Required tools and environment
Before diving in, make sure you have the following installed on your workstation:
- Python ≥ 3.9 (recommended 3.11)
pipandvirtualenv- Git ≥ 2.30
- Docker ≥ 20.10 (for containerized OpenClaw agents)
- UBOS CLI – install with
pip install ubos-cli
Access to the UBOS platform
You need an active UBOS account. If you haven’t signed up yet, visit the UBOS homepage and create a free developer tier. Once logged in, you’ll see the UBOS platform overview dashboard where you can manage projects, view pricing, and access the Workflow automation studio.
Creating an Evaluation Job
Step‑by‑step UI walkthrough
- Log in to UBOS and navigate to Projects → New Project. Name it
OpenClaw‑Eval‑Demo. - Inside the project, click Add Integration → OpenClaw. The UI will prompt you to select a pre‑built OpenClaw agent template or upload your own Docker image.
- Choose the AI Article Copywriter template as a quick start – it already contains a sample evaluation script.
- Click Create Evaluation Job. A modal appears where you can paste a JSON or YAML configuration (see next section).
- Save the job; it now appears in the Jobs list with a unique identifier (e.g.,
job-7f3b9c).
Sample JSON configuration
{
"job_name": "OpenClaw‑Agent‑Benchmark‑v1",
"agent_image": "docker.io/ubos/openclaw-agent:latest",
"datasets": [
{
"name": "qa‑benchmark‑v2",
"uri": "s3://ubos-datasets/qa-benchmark-v2.json"
}
],
"metrics": ["accuracy", "latency", "token_usage"],
"environment": {
"PYTHONPATH": "/app/src",
"MAX_CONCURRENCY": 4
},
"timeout_seconds": 7200
}
Equivalent YAML configuration
job_name: OpenClaw-Agent-Benchmark-v1
agent_image: docker.io/ubos/openclaw-agent:latest
datasets:
- name: qa-benchmark-v2
uri: s3://ubos-datasets/qa-benchmark-v2.json
metrics:
- accuracy
- latency
- token_usage
environment:
PYTHONPATH: /app/src
MAX_CONCURRENCY: 4
timeout_seconds: 7200
Configuring the Job
Setting parameters, datasets, and metrics
OpenClaw’s configuration schema is deliberately flat to keep jobs reproducible. Below are the most common knobs you’ll adjust:
- agent_image: Docker image containing your model and inference script.
- datasets: One or more JSON/CSV files hosted on S3, GCS, or UBOS’s built‑in storage.
- metrics: Built‑in metrics (accuracy, latency, token_usage) or custom Python callbacks.
- environment: Runtime variables that your container reads at start‑up.
- timeout_seconds: Hard limit for the entire job; useful for CI pipelines.
Code snippet for custom configuration
If you need to compute a bespoke metric (e.g., BLEU‑score for translation agents), add a Python callback in your Docker image and reference it in the JSON:
{
"custom_metrics": [
{
"name": "bleu_score",
"module": "metrics.bleu",
"function": "compute_bleu"
}
]
}
Inside metrics/bleu.py you would implement compute_bleu(predictions, references). UBOS will automatically import the module and aggregate the results.
Running the Job
Submitting the job via UBOS CLI
Open a terminal, activate your virtual environment, and run:
ubos login # authenticate
ubos job create --project OpenClaw-Eval-Demo --config ./job-config.yaml
The CLI returns a job ID. You can also submit via the REST API:
curl -X POST https://api.ubos.tech/v1/jobs \
-H "Authorization: Bearer $UBOS_TOKEN" \
-H "Content-Type: application/json" \
-d @job-config.json
Monitoring progress
UBOS provides three real‑time views:
- Dashboard: Visual progress bar and status badge (Queued, Running, Completed, Failed).
- Logs: Stream
stdout/stderrfrom the container directly in the UI. - Metrics Explorer: Auto‑generated charts for each metric defined in the config.
For CLI users, the following command tails logs:
ubos job logs --id job-7f3b9c --follow
Analyzing Results
Accessing logs and metrics
When the job finishes, UBOS stores a JSON artifact under gs://ubos-artifacts/<job-id>/results.json. Download it with:
ubos artifact download --job job-7f3b9c --path results.json
Visualizing output with example scripts
Below is a minimal Python script that loads the results and creates a Matplotlib bar chart for latency and accuracy:
import json
import matplotlib.pyplot as plt
with open('results.json') as f:
data = json.load(f)
metrics = data['metrics']
labels = list(metrics.keys())
values = [metrics[l]['value'] for l in labels]
plt.figure(figsize=(8,4))
plt.bar(labels, values, color=['#4F46E5', '#10B981'])
plt.title('OpenClaw Evaluation Summary')
plt.ylabel('Score')
plt.show()
The chart can be embedded directly into a UBOS Web app editor on UBOS dashboard for stakeholder reporting.
Troubleshooting Tips
Common errors and how to resolve them
| Error | Cause | Fix |
|---|---|---|
ImagePullBackOff | Docker image not found or private registry credentials missing. | Verify the agent_image tag and add a registry_secret in the job config. |
TimeoutError | Job exceeded timeout_seconds or dataset too large. | Increase the timeout or split the dataset into smaller shards. |
MetricNotFound | Custom metric module path incorrect. | Check the Python import path inside the Docker image; ensure the file is packaged. |
Debugging strategies
- Use
ubos job logs --id <job-id> --tail 200to fetch the last 200 lines. - Enable debug mode in your container by setting
ENV DEBUG=1in the Dockerfile. - Run the same configuration locally with
docker run --rm -v $(pwd):/data <image>to isolate container‑level issues.
Conclusion
Running OpenClaw evaluation jobs on UBOS transforms a traditionally manual benchmarking process into a repeatable, observable, and collaborative workflow. You now have a complete pipeline—from environment setup to result visualization—ready to be integrated into CI/CD pipelines or product dashboards.
Next steps you might consider:
- Automate nightly regression runs with the UBOS partner program for dedicated support.
- Explore the Enterprise AI platform by UBOS for multi‑tenant evaluation across dozens of agents.
- Leverage AI marketing agents to automatically generate performance reports for stakeholders.
References
- OpenClaw official documentation – openclaw.ai/docs
- UBOS CLI reference – UBOS CLI Docs
- Docker best practices – Docker Developer Guide
Ready to accelerate your AI agent testing? Sign up for a free UBOS account today and start building evaluation jobs in minutes.