Updated: February 22, 2026
6 min read

BinaryAudit Benchmark Reveals AI Agents’ Ability to Detect Hidden Backdoors

The BinaryAudit benchmark demonstrates that today’s AI agents can locate hidden backdoors in compiled binaries, but their detection rates hover around 50 % and false‑positive rates remain unacceptably high for production use.

Why BinaryAudit Matters for AI Security

In an era where supply‑chain attacks and firmware tampering threaten everything from cloud services to electric buses, the ability to automatically audit binaries is a game‑changer. The BinaryAudit benchmark, released in early 2026, pits leading large‑language‑model (LLM) agents—Claude, Gemini, and others—against a curated set of backdoored executables. By measuring detection accuracy, false‑positive frequency, and tool‑usage efficiency, the benchmark provides a clear yardstick for AI‑driven malware detection.

For developers, security teams, and decision‑makers evaluating AI safety, the results are both encouraging and cautionary. While some agents spot obvious malicious code, many miss subtle injections or misinterpret legitimate functionality as threats.

Read the full original report here.

BinaryAudit benchmark overview

Overview of the BinaryAudit Benchmark

The benchmark consists of four open‑source projects—lighttpd, dnsmasq, Dropbear, and Sozu—each compiled into stripped binaries (≈ 40 MB total). Researchers injected controlled backdoors, such as hidden HTTP headers that trigger popen() or DHCP options that execute execl("/bin/sh"). The tasks given to AI agents are two‑fold:

Identify whether a binary contains a backdoor.
Pinpoint the exact function address (e.g., 0x4a1c30) where the malicious code resides.

All agents operate with only open‑source reverse‑engineering tools—Ghidra and Radare2—mirroring realistic constraints for many security teams.

Methodology and AI Agents Used

Each AI agent received a Docker‑isolated environment containing the binary and a pre‑installed toolkit:

Claude Opus 4.5‑4.6 (Anthropic) – the most capable Claude version at the time.
Gemini 3 Pro (Google) – Google’s flagship multimodal model.
OpenAI ChatGPT (GPT‑4‑Turbo) – accessed via the OpenAI ChatGPT integration.

Agents were prompted to run a series of commands (e.g., strings, nm -D, r2 -q -c 'aaa; axt @ sym.imp.popen') and then reason about the output. The benchmark recorded:

Detection success rate.
False‑positive rate on clean binaries.
Number of tool invocations (a proxy for cost‑effectiveness).

All experiments are reproducible via the public BinaryAudit repository.

Example Backdoors and Detection Results

Lighttpd HTTP‑Header Backdoor

The injected code scans for an undocumented header X-Forwarded-Debug, runs its value with popen(), and returns the output in X-Request-Trace. Claude Opus 4.5 discovered the backdoor in under five minutes by:

Listing shared libraries and spotting popen imports.
Using Radare2 to trace the call graph to li_check_debug_header.
Confirming the function is invoked from the request‑handling path.

Gemini 3 Pro missed the same backdoor, assuming the popen usage was part of legitimate worker‑process logic.

Dnsmasq DHCP‑Option Backdoor

This backdoor executes arbitrary shell commands received via a fabricated DHCP option (224). Even though the execl("/bin/sh") call appears in strings, Claude Opus 4.6 incorrectly labeled it as a benign lease‑script function, leading to a false negative.

Overall detection rates:

Agent	Success Rate	False‑Positive Rate
Claude Opus 4.6	49 %	28 %
Gemini 3 Pro	44 %	31 %
ChatGPT (GPT‑4‑Turbo)	37 %	35 %

These numbers illustrate that while AI agents can occasionally outperform a junior analyst, they are far from production‑grade reliability.

Challenges, False‑Positives, and Limitations

Needle‑in‑Haystack Problem – Binaries often contain thousands of functions. Current agents lack the strategic intuition to prioritize high‑risk entry points (e.g., network parsers). They waste context on benign utility functions, missing the tiny malicious snippet.

Tooling Gap – Open‑source decompilers (Ghidra, Radare2) lag behind commercial solutions like IDA Pro, especially for Rust and Go binaries. The benchmark deliberately excluded Go binaries because the tools produced unusable output, skewing results toward C‑based projects.

Hallucination and Over‑Confidence – Models sometimes generate plausible‑looking findings that do not exist in the binary (e.g., fabricated “max‑cache‑ttl” backdoor reported by Gemini). Such hallucinations inflate false‑positive rates and erode trust.

Context Window Constraints – Even the largest LLMs can only ingest a few thousand tokens. When an agent streams command output, it must truncate or summarize, potentially discarding critical clues.

These challenges explain why the benchmark reports a false‑positive rate of roughly one‑third for the best agents—a level unacceptable for enterprise security pipelines.

Future Outlook and Next Steps

Improving AI‑driven binary analysis will likely follow three parallel tracks:

Context Engineering & Specialized Skills – Providing agents with custom “skills” for reverse engineering (e.g., a “Ghidra‑MCP” skill) can focus their reasoning and reduce hallucinations.
Commercial Tool Integration – Embedding IDA Pro or Binary Ninja APIs would give agents higher‑quality decompilation, especially for Rust and Go binaries.
Local, Fine‑Tuned Models – Organizations can host private, security‑focused LLMs that never leave the network, mitigating data‑leak concerns while allowing adversarial training to recognize evasion techniques.

UBOS is already positioning itself to support these trends. Our UBOS platform overview includes a Workflow automation studio that can orchestrate Ghidra scans, feed results to a fine‑tuned LLM, and automatically generate remediation tickets.

For startups looking to prototype AI‑enhanced security pipelines, the UBOS for startups page outlines a low‑cost entry point, while SMBs can explore the UBOS solutions for SMBs for scalable deployment.

Finally, the Enterprise AI platform by UBOS promises built‑in compliance, audit logs, and on‑premise model hosting—exactly the ingredients needed to turn experimental benchmarks into operational defenses.

Conclusion: AI Is Ready, But Not Yet Reliable

The BinaryAudit benchmark proves that AI agents can sometimes locate hidden backdoors in compiled binaries, yet their detection rates hover around 50 % and false‑positive rates exceed 25 %. For organizations that cannot afford noisy alerts, these tools are best used as a **first‑pass assistant** rather than a definitive scanner.

If you’re interested in experimenting with AI‑augmented binary analysis, start by integrating UBOS’s ChatGPT and Telegram integration to receive automated scan reports directly in your workflow. Pair that with the UBOS templates for quick start such as the AI SEO Analyzer or the AI Article Copywriter to generate clear remediation documentation.

Stay tuned for the next iteration of BinaryAudit, which will incorporate commercial decompilers, larger context windows, and domain‑specific fine‑tuning. In the meantime, leverage UBOS’s partner program to get early access to upcoming AI security modules.

Ready to secure your binaries with AI? Contact us today and start building a smarter, faster defense pipeline.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

BinaryAudit Benchmark Reveals AI Agents’ Ability to Detect Hidden Backdoors

Why BinaryAudit Matters for AI Security

Overview of the BinaryAudit Benchmark

Methodology and AI Agents Used

Example Backdoors and Detection Results

Challenges, False‑Positives, and Limitations

Future Outlook and Next Steps

Conclusion: AI Is Ready, But Not Yet Reliable

Carlos

Calculate Time Complexity with ChatGPT API

AI Chat Bot: Text, Voice, and Video Magic

Python Bug Fixer

Pharmacy Admin Panel

Speech to Text

AI-Powered Product List Manager

Sign up for our newsletter

Why BinaryAudit Matters for AI Security

Overview of the BinaryAudit Benchmark

Methodology and AI Agents Used

Example Backdoors and Detection Results

Challenges, False‑Positives, and Limitations

Future Outlook and Next Steps

Conclusion: AI Is Ready, But Not Yet Reliable

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password