Updated: March 11, 2026
3 min read

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

ASTRA-bench illustration

Authors: Zidi Xiu, David Q. Sun, Kevin Cheng, Maitrik Patel, Josh Date, Yizhe Zhang, Jiarui Lu, Omar Attia, Raviteja Vemulapalli, Oncel Tuzel, Meng Cao, Samy Bengio

Published: March 3, 2026

Abstract: Next‑generation AI must manage vast personal data, diverse tools, and multi‑step reasoning, yet most benchmarks remain context‑free and single‑turn. We present ASTRA‑bench (Assistant Skills in Tool‑use, Reasoning & Action‑planning), a benchmark that uniquely unifies time‑evolving personal context with an interactive toolbox and complex user intents. Our event‑driven pipeline generates 2,413 scenarios across four protagonists, grounded in longitudinal life events and annotated by referential, functional, and informational complexity. Evaluation of state‑of‑the‑art models (e.g., Claude‑4.5‑Opus, DeepSeek‑V3.2) reveals significant performance degradation under high‑complexity conditions, with argument generation emerging as the primary bottleneck. These findings expose critical limitations in current agents’ ability to ground reasoning within messy personal context and orchestrate reliable multi‑step plans. We release ASTRA‑bench with a full execution environment and evaluation scripts to provide a diagnostic testbed for developing truly context‑aware AI assistants.

Why ASTAST‑bench Matters

Bridges the gap between isolated tool‑use benchmarks and real‑world personal‑assistant scenarios.
Introduces longitudinal user context, enabling evaluation of memory, continuity, and privacy‑aware reasoning.
Provides a rich set of difficulty levels (referential, functional, informational) for fine‑grained model analysis.

Benchmark Design

The benchmark is built on an event‑driven pipeline that simulates realistic life events for four distinct personas. Each scenario combines:

Personal Context: Time‑stamped user data such as calendar entries, emails, and device logs.
Toolbox: A curated set of APIs (e.g., email, calendar, file system, web search) that the agent can invoke.
User Intent: Multi‑step, ambiguous requests that require planning, tool selection, and context grounding.

Key Findings

When evaluated on leading large‑language models, ASTRA‑bench highlights three critical pain points:

Context Overload: Performance drops sharply as the amount of personal history increases.
Tool‑Selection Errors: Models often choose sub‑optimal tools or misuse them.
Argument Generation Bottleneck: The ability to construct coherent, step‑by‑step arguments is the biggest predictor of success.

Getting Started

The full dataset, execution environment, and evaluation scripts are publicly available on arXiv and our UBOS repository. Researchers can quickly spin up the benchmark using our Docker images and start testing their agents.

Conclusion

ASTRA‑bench sets a new standard for assessing AI assistants in realistic, context‑rich environments. By exposing current limitations, it paves the way for next‑generation agents that can truly understand and act upon personal user data.

Read the full paper, explore the code, and join the community at ubos.tech.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Why ASTAST‑bench Matters

Benchmark Design

Key Findings

Getting Started

Conclusion

Carlos

Talk with Claude 3

Unified Authorization Template

Customer Relationship Management (CRM)

Speech to Text

AI-Powered Essay Outline Generator

Image to text with Claude 3

Sign up for our newsletter

ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Why ASTAST‑bench Matters

Benchmark Design

Key Findings

Getting Started

Conclusion

Share

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password