Updated: April 6, 2026
6 min read

Claude AI Benchmarking with Mdarena – New Open‑Source Developer Tool on GitHub

## Overview
Skip to content You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert HudsonGri / mdarena Public Notifications You must be signed in to change notification settings Fork 0 Star 2 Code Issues 0 Pull requests 1 Actions Projects Security and quality 0 Insights Additional navigation options Code Issues Pull requests Actions Projects Security and quality Insights mainBranchesTagsGo to fileCodeOpen more actions menuFolders and filesNameNameLast commit messageLast commit dateLatest commit History5 Commits5 Commits.claude/skills.claude/skills .github.github src/mdarenasrc/mdarena teststests .gitignore.gitignore .pre-commit-config.yaml.pre-commit-config.yaml CLAUDE.mdCLAUDE.md LICENSELICENSE README.mdREADME.md ROADMAP.mdROADMAP.md pyproject.tomlpyproject.toml uv.lockuv.lock View all filesRepository files navigationREADMEMIT licensemdarena Benchmark your CLAUDE.md against your own PRs. Most CLAUDE.md files are written blindly. Research shows they often reduce agent success rates and cost 20%+ more tokens.mdarena lets you measure whether yours helps or hurts, on tasks from your actual codebase. Quick Start pip install mdarena # Mine 50 merged PRs into a test set mdarena mine owner/repo –limit 50 –detect-tests # Benchmark multiple CLAUDE.md files + baseline (no context) mdarena run -c claude_v1.md -c claude_v2.md -c agents.md # See who wins mdarena report How It Works mdarena mine -> Fetch merged PRs, filter, build task set Auto-detect test commands from CI/package files mdarena run -> For each task x condition: – Checkout repo at pre-PR commit – Baseline: all CLAUDE.md files stripped – Context: inject CLAUDE.md, let Claude discover it – Run tests if available, capture git diff mdarena report -> Compare patches against gold (actual PR diff) – Test pass/fail (same as SWE-bench) – File/hunk overlap, cost, tokens – Statistical significance (paired t-test) Test Execution mdarena can run your repo’s actual tests to grade agent patches, the same way SWE-bench does it.# Auto-detect from CI/CD mdarena mine owner/repo –detect-tests # Or specify manually mdarena mine owner/repo –test-cmd “make test” –setup-cmd “npm install” Parses .github/workflows/*.yml, package.json, pyproject.toml, Cargo.toml, and go.mod. When tests aren’t available, falls back to diff overlap scoring. Monorepo Support Pass a directory to benchmark a full CLAUDE.md tree: mdarena run -c ./configs-v1/ -c ./configs-v2/ Each directory mirrors your repo structure. Baseline strips ALL CLAUDE.md and AGENTS.md files from the entire tree. Real-world Results We ran mdarena against a large production monorepo: 20 merged PRs, Claude Opus 4.6, three conditions (bare baseline, existing CLAUDE.md, hand-written alternative). Patches graded against real test suites. Not string matching, not LLM-as-judge. Key findings: The existing CLAUDE.md improved test resolution by ~27% over bare baseline A consolidated alternative that merged all per-directory guidance into one file performed no better than no CLAUDE.md at all On hard tasks, per-directory instruction files gave the agent targeted context, while the consolidated version introduced noise that caused regressions The winning CLAUDE.md wasn’t the longest or most detailed. It was the one that put the right context in front of the agent at the right time.SWE-bench Compatible # Import SWE-bench tasks pip install datasets mdarena load-swebench lite –limit 50 mdarena run -c my_claude.md # Or export your tasks as SWE-bench JSONL mdarena export-swebench Security Only benchmark repositories you trust. mdarena executes code from the repos it benchmarks (test commands run via shell=True, Claude Code runs with –dangerously-skip-permissions). Sandboxes are isolated temp directories under /tmp but processes run as your user.Benchmark integrity: Because tasks come from historical PRs, the gold patch is in the repo’s git history. Claude 4 Sonnet exploited this against SWE-bench by walking future commits via tags. mdarena prevents this with history-free checkouts: git archive exports a snapshot at base_commit into a fresh single-commit repo. Future commits don’t exist in the object database at all. See tests/test_isolated_checkout.py for the integrity assertions. Prerequisites Python 3.11+ gh CLI (authenticated) claude CLI (Claude Code) git Commands Command Description mdarena mine Mine merged PRs into a task set mdarena mine –detect-tests Mine with auto-detected test extraction mdarena run -c file.md Benchmark a single CLAUDE.md mdarena run -c a.md -c b.md Compare multiple files head-to-head mdarena run –no-run-tests Skip test execution, diff overlap only mdarena report Analyze results, show comparison mdarena load-swebench [dataset] Import SWE-bench tasks mdarena export-swebench Export tasks as SWE-bench JSONL Development git clone https://github.com/HudsonGri/mdarena.git cd mdarena uv sync uv run pytest uv run ruff check src/ Roadmap See ROADMAP.md. License MIT. See LICENSE. About Benchmark your CLAUDE.md against your own PRs Resources Readme License MIT license Uh oh! There was an error while loading. Please reload this page. Activity Stars 2 stars Watchers 0 watching Forks 0 forks Report repository Releases No releases published Packages 0 Uh oh! There was an error while loading. Please reload this page. Contributors 1 HudsonGri Hudson Languages Python 100.0% You can’t perform that action at this time. [{“Name”:”.claude/skills”,”Last commit message”:””,”Last commit date”:””},{“Name”:”.github”,”Last commit message”:””,”Last commit date”:””},{“Name”:”src/mdarena”,”Last commit message”:””,”Last commit date”:””},{“Name”:”tests”,”Last commit message”:””,”Last commit date”:””},{“Name”:”.gitignore”,”Last commit message”:””,”Last commit date”:””},{“Name”:”.pre-commit-config.yaml”,”Last commit message”:””,”Last commit date”:””},{“Name”:”CLAUDE.md”,”Last commit message”:””,”Last commit date”:””},{“Name”:”LICENSE”,”Last commit message”:””,”Last commit date”:””},{“Name”:”README.md”,”Last commit message”:””,”Last commit date”:””},{“Name”:”ROADMAP.md”,”Last commit message”:””,”Last commit date”:””},{“Name”:”pyproject.toml”,”Last commit message”:””,”Last commit date”:””},{“Name”:”uv.lock”,”Last commit message”:””,”Last commit date”:””},{}]

## What is Mdarena?
Mdarena is an open‑source framework that automates AI benchmarking for Claude .md models directly within pull‑request workflows. It streamlines performance testing, generates detailed reports, and integrates seamlessly with GitHub CI pipelines.

## Key Features
– Automated benchmark execution on every PR
– Comprehensive metrics and visual reports
– Easy configuration via YAML files
– Community‑driven roadmap and contributions

## Why It Matters for Developers
By embedding AI benchmarking into the development cycle, teams can catch regressions early, ensure model quality, and accelerate iteration cycles.

[Read the original repository on GitHub](https://github.com/HudsonGri/mdarena)

*For more AI‑tool insights, explore our related articles on ubos.tech.*

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Claude AI Benchmarking with Mdarena – New Open‑Source Developer Tool on GitHub

Carlos

AI Chatbot Starter Kit

AI Voice Assistant (Voice-Text-Voice)

AI Chatbot Starter Kit v0.1

Image Generation with Stable Diffusion

AI-Powered Product List Manager

AI Chat Bot: Text, Voice, and Video Magic

Sign up for our newsletter

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password