- Updated: March 11, 2026
- 3 min read
ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Authors: Zidi Xiu, David Q. Sun, Kevin Cheng, Maitrik Patel, Josh Date, Yizhe Zhang, Jiarui Lu, Omar Attia, Raviteja Vemulapalli, Oncel Tuzel, Meng Cao, Samy Bengio
Published: March 3, 2026
Abstract: Next‑generation AI must manage vast personal data, diverse tools, and multi‑step reasoning, yet most benchmarks remain context‑free and single‑turn. We present ASTRA‑bench (Assistant Skills in Tool‑use, Reasoning & Action‑planning), a benchmark that uniquely unifies time‑evolving personal context with an interactive toolbox and complex user intents. Our event‑driven pipeline generates 2,413 scenarios across four protagonists, grounded in longitudinal life events and annotated by referential, functional, and informational complexity. Evaluation of state‑of‑the‑art models (e.g., Claude‑4.5‑Opus, DeepSeek‑V3.2) reveals significant performance degradation under high‑complexity conditions, with argument generation emerging as the primary bottleneck. These findings expose critical limitations in current agents’ ability to ground reasoning within messy personal context and orchestrate reliable multi‑step plans. We release ASTRA‑bench with a full execution environment and evaluation scripts to provide a diagnostic testbed for developing truly context‑aware AI assistants.
Why ASTAST‑bench Matters
- Bridges the gap between isolated tool‑use benchmarks and real‑world personal‑assistant scenarios.
- Introduces longitudinal user context, enabling evaluation of memory, continuity, and privacy‑aware reasoning.
- Provides a rich set of difficulty levels (referential, functional, informational) for fine‑grained model analysis.
Benchmark Design
The benchmark is built on an event‑driven pipeline that simulates realistic life events for four distinct personas. Each scenario combines:
- Personal Context: Time‑stamped user data such as calendar entries, emails, and device logs.
- Toolbox: A curated set of APIs (e.g., email, calendar, file system, web search) that the agent can invoke.
- User Intent: Multi‑step, ambiguous requests that require planning, tool selection, and context grounding.
Key Findings
When evaluated on leading large‑language models, ASTRA‑bench highlights three critical pain points:
- Context Overload: Performance drops sharply as the amount of personal history increases.
- Tool‑Selection Errors: Models often choose sub‑optimal tools or misuse them.
- Argument Generation Bottleneck: The ability to construct coherent, step‑by‑step arguments is the biggest predictor of success.
Getting Started
The full dataset, execution environment, and evaluation scripts are publicly available on arXiv and our UBOS repository. Researchers can quickly spin up the benchmark using our Docker images and start testing their agents.
Conclusion
ASTRA‑bench sets a new standard for assessing AI assistants in realistic, context‑rich environments. By exposing current limitations, it paves the way for next‑generation agents that can truly understand and act upon personal user data.
Read the full paper, explore the code, and join the community at ubos.tech.