- Updated: March 11, 2026
- 8 min read
TraceSIR: A Multi-Agent Framework for Structured Analysis and Reporting of Agentic Execution Traces
Direct Answer
TraceSIR introduces a multi‑agent framework that automatically compresses, diagnoses, and reports on the long, tangled execution traces produced by agentic AI systems. By turning raw trace logs into a structured, actionable narrative, it makes failure analysis scalable and reliable for real‑world deployments.
Background: Why This Problem Is Hard
Agentic systems—large language models (LLMs) augmented with tool use, planning loops, and external APIs—are now the backbone of complex workflows such as autonomous research assistants, code‑generation pipelines, and multi‑step function‑calling bots. While these agents can solve sophisticated problems, each decision point spawns a cascade of tool invocations, state updates, and conditional branches. The resulting execution trace can contain thousands of lines of JSON, console output, and API payloads.
Three inter‑related bottlenecks make trace analysis a pressing challenge:
- Volume and length limits. Modern LLMs have context windows of a few thousand tokens, far smaller than the size of a full trace. Feeding the raw log into another model for debugging quickly exceeds these limits.
- Signal‑to‑noise ratio. Most entries are routine bookkeeping (e.g., “tool called”, “response received”). Critical failure cues—mis‑parsed arguments, dead‑end planning steps, or subtle latency spikes—are buried deep within the noise.
- Human scalability. Engineers currently resort to manual inspection, a time‑consuming process that does not scale across dozens or hundreds of daily runs. Moreover, manual reviews are prone to bias and miss root causes that only emerge after cross‑run comparison.
Existing approaches either truncate traces to fit within model windows—losing essential context—or rely on generic log‑parsing scripts that cannot reason about the semantics of agentic decisions. Consequently, diagnosing why an agent failed, pinpointing the exact step that caused a regression, or suggesting concrete optimizations remains largely manual.
What the Researchers Propose
TraceSIR tackles the problem by orchestrating three specialized agents, each with a focused responsibility:
- StructureAgent creates a novel abstraction called TraceFormat. This format compresses the raw execution log into a hierarchical representation that preserves decision‑making context while discarding redundant boilerplate.
- InsightAgent consumes the structured trace to perform fine‑grained diagnosis. It localizes the failure, conducts root‑cause analysis, and generates optimization suggestions such as prompt refinements or tool‑selection heuristics.
- ReportAgent aggregates insights across multiple task instances, synthesizing a cohesive analysis report that highlights common patterns, outliers, and actionable recommendations.
The key conceptual leap is the separation of concerns: compression, reasoning, and reporting are delegated to agents that can be independently optimized, swapped, or scaled. This modularity also enables the framework to remain agnostic to the underlying LLM or tool suite used by the target agentic system.
How It Works in Practice
Conceptual Workflow
The end‑to‑end pipeline can be visualized as a three‑stage assembly line:
| Stage | Agent | Input | Output |
|---|---|---|---|
| 1. Trace Compression | StructureAgent | Raw execution log (JSON, stdout, API calls) | TraceFormat (tree‑structured summary) |
| 2. Insight Generation | InsightAgent | TraceFormat + optional ground‑truth labels | Diagnoses, root‑cause tags, optimization hints |
| 3. Report Synthesis | ReportAgent | Collection of insights from multiple runs | Human‑readable analysis report (PDF/HTML) |
Interaction Details
- Step 1 – Structured Abstraction. StructureAgent parses the raw trace using a combination of pattern matching and a lightweight LLM that operates within a 2‑k token window. It extracts high‑level actions (e.g., “search web”, “invoke calculator”), annotates them with timestamps, and nests sub‑actions under their parent decisions, forming a directed acyclic graph.
- Step 2 – Diagnostic Reasoning. InsightAgent receives the graph and runs a chain‑of‑thought prompting routine that asks “Which node deviates from the expected success path?” and “What evidence supports this deviation?” The agent produces a structured JSON payload containing:
- Failure node identifier
- Probable cause (e.g., malformed argument, tool timeout)
- Confidence score
- Suggested remediation (e.g., adjust prompt template, increase retry limit)
- Step 3 – Cross‑Run Aggregation. ReportAgent ingests insight payloads from a batch of runs (often dozens). It clusters similar failure modes, computes frequency statistics, and drafts narrative sections that summarize systemic issues versus one‑off anomalies.
What sets TraceSIR apart is the explicit, reusable contract between agents (TraceFormat) that sidesteps the raw‑trace length barrier while retaining semantic richness. Because each agent can be swapped for a more powerful model or a domain‑specific rule set, the framework can evolve alongside advances in LLM capabilities.
Evaluation & Results
Benchmark Suite: TraceBench
To validate the framework, the authors built TraceBench, a collection of three real‑world agentic scenarios:
- Research Assistant – an LLM that iteratively searches scholarly databases, extracts citations, and drafts a literature review.
- Function‑Calling Bot – a code‑generation agent that calls external APIs to retrieve data, performs transformations, and returns a JSON report.
- Autonomous Coding Agent – a multi‑step system that writes, tests, and debugs Python scripts using a REPL tool.
Each scenario generated 200 execution traces, half of which contained injected failures (e.g., API timeouts, mis‑parsed arguments). The authors then compared TraceSIR against two baselines:
- Raw‑LLM Review – feeding the uncompressed trace to a large model with a “find the bug” prompt.
- Heuristic Log Parser – a rule‑based script that flags error codes and stack traces.
ReportEval Protocol
Beyond raw accuracy, the authors introduced ReportEval, a human‑centric evaluation that scores generated reports on four dimensions:
- Coherence – logical flow and readability.
- Informative Value – presence of actionable insights.
- Root‑Cause Precision – correctness of the diagnosed failure point.
- Actionability – usefulness of suggested optimizations.
Three senior ML engineers rated 30 randomly selected reports per method on a 5‑point Likert scale.
Key Findings
- Diagnosis Accuracy. TraceSIR correctly localized failures in 92 % of cases, versus 68 % for Raw‑LLM Review and 55 % for the heuristic parser.
- Report Quality. Across the four ReportEval dimensions, TraceSIR achieved an average score of 4.6, outpacing the next best baseline (3.4) by a full point.
- Scalability. The compression step reduced average trace size from 12 KB to 1.3 KB (≈ 89 % reduction) without losing diagnostic fidelity, enabling the InsightAgent to operate comfortably within a 4‑k token window.
- Cross‑Run Insight. ReportAgent identified a recurring pattern of “tool‑timeout mis‑classification” across the Coding Agent scenario, prompting a simple retry‑policy change that cut failure rates by 37 % in a follow‑up experiment.
Collectively, these results demonstrate that TraceSIR not only outperforms naïve baselines but also delivers actionable system‑level improvements that would be difficult to uncover through manual log inspection.
Why This Matters for AI Systems and Agents
For practitioners building production‑grade agentic pipelines, TraceSIR offers a concrete pathway to move from reactive debugging to proactive system health monitoring. The framework’s modular agents can be integrated into existing orchestration platforms, providing:
- Automated Root‑Cause Detection. Engineers no longer need to sift through gigabytes of logs; the InsightAgent surfaces the exact node responsible for a failure.
- Continuous Optimization Loop. The ReportAgent’s aggregated insights feed directly into CI/CD pipelines, enabling automated adjustments to prompts, tool selection, or retry policies.
- Compliance and Auditing. Structured reports satisfy governance requirements by documenting decision pathways and remediation steps, a growing need in regulated industries.
- Resource Efficiency. By compressing traces early, the framework reduces storage costs and keeps downstream LLM inference within context limits.
Organizations that have adopted agentic architectures—such as autonomous research assistants or code‑generation services—can plug TraceSIR into their monitoring stack to achieve faster mean‑time‑to‑resolution (MTTR) and higher overall reliability.
For a deeper dive into integrating trace analysis with orchestration tools, see our guide on agent orchestration best practices.
What Comes Next
While TraceSIR marks a significant step forward, several open challenges remain:
- Domain‑Specific Extensions. The current TraceFormat is generic; extending it to capture domain‑specific semantics (e.g., financial transaction flows) will require custom ontologies.
- Real‑Time Diagnosis. Presently, the pipeline runs post‑hoc. Embedding InsightAgent reasoning into the live execution loop could enable self‑healing agents that adapt on the fly.
- Scalable Multi‑Agent Collaboration. As agent ecosystems grow, coordinating diagnostics across heterogeneous agents (different LLM providers, toolsets) will demand standardized trace schemas.
- Privacy‑Preserving Trace Sharing. Sharing traces for collective learning raises data‑privacy concerns; techniques such as differential privacy or secure enclaves could be explored.
Future research may also investigate coupling TraceSIR with reinforcement‑learning‑based policy improvement, where the InsightAgent’s suggestions directly inform reward shaping for the underlying agent.
Developers interested in experimenting with trace‑analysis tooling can explore the open‑source implementation and benchmark suite on the project’s GitHub repository: TraceSIR GitHub. For a concise overview of the original research, refer to the original arXiv paper.
Additional resources on building robust trace‑analysis pipelines are available at Trace Analysis Tools, where you can find templates, SDKs, and community case studies.