Updated: June 29, 2026
8 min read

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

Direct Answer

The paper introduces AgentCIBench, a systematic evaluation harness that measures whether computer‑use agents (CUAs) respect contextual integrity—the privacy principle that governs appropriate information flow across different user contexts. By turning privacy‑risk scenarios into deterministic, scored tests, the authors reveal that most state‑of‑the‑art agents leak sensitive data far more often than developers anticipate, highlighting an urgent need for pre‑deployment safety checks.

Background: Why This Problem Is Hard

Computer‑use agents have moved from experimental prototypes to everyday assistants that draft emails, schedule meetings, and manage to‑do lists. Their utility stems from the ability to cross‑application boundaries: a single prompt can trigger actions in a calendar, an email client, and a note‑taking app simultaneously. However, this very cross‑application access creates a privacy blind spot that traditional security models do not capture.

Existing safeguards—such as permission scopes, sandboxing, or static data‑flow analysis—assume that an agent’s actions are confined to a single, well‑defined task. In practice, agents often operate under ambiguous user instructions, visual UI cues, or implicit expectations about who will receive the output. These ambiguities give rise to three concrete failure modes:

Visual co‑location: The agent copies information that appears adjacent to the target UI element, even if that information belongs to a different privacy context.
Task‑ambiguity overshare: When a prompt is underspecified, the agent compensates by dumping a large chunk of the user’s personal state, exposing data that was never requested.
Recipient misalignment: The agent forwards content to a recipient for whom the information is inappropriate, violating the intended audience’s privacy expectations.

Current research on AI safety largely focuses on hallucinations, alignment, or adversarial robustness. Few works address the nuanced, context‑dependent leakage that emerges when agents interact with real‑world productivity tools. Moreover, there is no standardized benchmark that can reliably surface these leaks before an agent reaches production.

What the Researchers Propose

To fill this gap, Goel and Gurevych propose AgentCIBench, an evaluation harness that operationalizes contextual integrity as a set of executable test scenarios. The framework consists of three core components:

Scenario Generator: Encodes privacy‑risk situations into deterministic UI states and natural‑language prompts. Each scenario targets one of the three failure modes.
Agent Interface Layer: Wraps any CUA—whether a large language model with a UI‑automation plugin or a rule‑based macro—so that it can receive the prompt, interact with the simulated desktop environment, and produce an output.
Scoring Engine: Compares the agent’s output against a ground‑truth privacy policy, flagging any disclosed items that violate contextual integrity. Scores are binary (pass/fail) and aggregated into a leakage percentage.

The design is deliberately model‑agnostic: researchers can plug in a new agent without modifying the benchmark, enabling apples‑to‑apples comparisons across the rapidly evolving CUA landscape.

How It Works in Practice

AgentCIBench follows a straightforward workflow that mirrors a real user’s interaction with a computer:

Environment Setup: A lightweight virtual desktop is instantiated with a set of pre‑populated applications (email client, calendar, notes, etc.). Each app contains synthetic but realistic personal data, deliberately segmented into distinct privacy contexts (e.g., “work”, “family”, “medical”).
Scenario Injection: The Scenario Generator places a target UI element (e.g., a new calendar event field) alongside other UI elements that hold prohibited data (e.g., a medical note). A natural‑language instruction is then issued to the agent, such as “Create a meeting with Alice tomorrow.”
Agent Execution: Through the Agent Interface Layer, the CUA receives the instruction, navigates the UI, and produces an output—typically a composed email, a calendar entry, or a file export.
Leak Detection: The Scoring Engine parses the output, extracts any referenced data, and checks whether each datum respects the contextual integrity rule defined for that scenario. For example, if the email includes a snippet from the medical note, the engine records a violation.
Result Aggregation: After running a batch of scenarios (hundreds per failure mode), the system reports the proportion of leaks, average severity, and per‑scenario breakdowns.

What sets AgentCIBench apart from prior safety tests is its focus on contextual rather than purely semantic correctness. The benchmark does not penalize an agent for providing the right answer; it penalizes the agent for pulling in data from the wrong context, even if that data is factually accurate.

Evaluation & Results

The authors evaluated fifteen frontier CUAs, ranging from open‑source agents built on Ollama to commercial offerings that integrate with proprietary productivity suites. Each agent was subjected to 300+ scenarios covering the three failure modes.

Key Findings

High Failure Rate: Eleven out of fifteen agents leaked information in more than half of the tested scenarios.
Average Leakage: Across all agents, the mean leakage rate was 67.9 %, meaning that two‑thirds of the time the agents disclosed at least one piece of inappropriate data.
Consistent Weaknesses: Visual co‑location was the most prevalent failure mode (≈78 % of leaks), followed by task‑ambiguity overshare (≈65 %) and recipient misalignment (≈52 %).
End‑to‑End Persistence: When agents were allowed to complete the full task (e.g., actually sending the email), the same leakage patterns persisted, confirming that the issue is not limited to sandboxed output generation.

These results demonstrate that even agents touted as “privacy‑aware” can inadvertently become data‑leaking conduits when operating in realistic, multi‑application environments. The benchmark’s deterministic scoring also revealed that some agents performed well on isolated tasks but faltered when the UI presented visually adjacent, sensitive items—a nuance that traditional unit tests would miss.

Why This Matters for AI Systems and Agents

For practitioners building AI‑driven assistants, the findings carry three immediate takeaways:

Pre‑deployment Testing Must Include Contextual Checks: Traditional functional tests verify that an agent can schedule a meeting or draft an email. AgentCIBench adds a privacy‑layer, ensuring that the same actions do not cross contextual boundaries.
Designing Prompt‑Robust Interfaces: Ambiguous prompts are a primary driver of oversharing. Embedding clarification loops—where the agent asks follow‑up questions before accessing broader user state—can dramatically reduce leakage.
UI‑Aware Guardrails: Visual co‑location failures suggest that agents need to be aware of UI layout, not just textual content. Integrating visual attention models or UI‑semantic parsers can help the agent distinguish “target” from “adjacent” elements.

Organizations that already rely on AI assistants for customer outreach, internal coordination, or knowledge management can leverage AgentCIBench as a safety gate before rolling out new agents. By embedding the benchmark into CI/CD pipelines, teams can catch privacy regressions early, much like they catch performance regressions with unit tests.

For example, a company using the UBOS platform overview to orchestrate AI workflows can integrate AgentCIBench into its Workflow automation studio, automatically flagging any new agent that violates contextual integrity before it reaches end users.

What Comes Next

While AgentCIBench marks a significant step forward, the authors acknowledge several limitations that open fertile ground for future research:

Scalability of Scenarios: The current benchmark relies on handcrafted UI states. Automating scenario generation with generative UI synthesis could broaden coverage to more applications and edge cases.
Dynamic Context Modeling: Real‑world privacy contexts evolve (e.g., a project moves from “confidential” to “public”). Future versions could incorporate a context‑evolution engine that tests agents under shifting policies.
User‑Feedback Loops: Integrating real user feedback on perceived privacy violations would enable agents to learn corrective behavior, turning the benchmark from a static test into an adaptive training signal.
Cross‑Domain Extensions: Extending the framework beyond desktop productivity—into mobile, voice assistants, or IoT—would test whether the same failure modes manifest in other interaction modalities.

Practitioners interested in building safer agents can start by adopting the benchmark today and contributing new scenarios back to the community. The open‑source nature of AgentCIBench encourages collaborative improvement, much like the About UBOS initiative promotes shared standards for AI safety.

In the longer term, we anticipate a shift where contextual integrity testing becomes a regulatory requirement for AI assistants handling personal data. Early adopters who embed these checks will not only mitigate risk but also gain a competitive advantage by demonstrating a commitment to privacy‑by‑design.

Conclusion

AgentCIBench shines a light on a hidden privacy hazard: computer‑use agents that are technically capable can still be careless about the context in which they operate. By providing a deterministic, model‑agnostic benchmark, the authors give the AI community a concrete tool to measure and improve contextual integrity. The high leakage rates observed across leading agents underscore the urgency of integrating privacy‑aware testing into every stage of agent development. As AI assistants become ubiquitous in enterprises and consumer products, frameworks like AgentCIBench will be essential for ensuring that capability does not come at the expense of user trust.

Read the full research paper for a deeper dive: Capable but Careless: Do Computer‑Use Agents Follow Contextual Integrity?

Illustration of computer‑use agents and privacy context

Andrii Bidochko

CTO UBOS

Andrii Bidochko is an AI entrepreneur and researcher focused on AI agents, reinforcement learning, and autonomous systems. He writes about the technologies shaping the future of machine intelligence, from frontier models and agent architectures to real-world AI applications.

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Key Findings

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Andrii Bidochko

Image to text with Claude 3

Service ERP

AI Voice Assistant (Voice-Text-Voice)

Speech to Text

AI-Powered Essay Outline Generator

Multi-language AI Translator

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Evaluation & Results

Key Findings

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Share

Andrii Bidochko

Sign up for our newsletter

Sign In

Register

Reset Password