- Updated: June 10, 2026
- 8 min read
Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles
Direct Answer
The paper introduces WIRE (Witnessed Intra‑policy Rule Evaluation), a systematic pipeline that discovers and diagnoses conflicting rule pairs hidden inside a single natural‑language prompt policy governing LLM agents, and then measures how those agents resolve the tension in real‑time responses or tool actions. This matters because undetected policy clashes can cause unpredictable behavior, compliance breaches, or safety failures in production AI agents.
Background: Why This Problem Is Hard
Large language model (LLM) agents are increasingly deployed with prompt policies—long‑form natural‑language instructions that encode standing rules, ethical guardrails, and operational constraints. Unlike hard‑coded code, these policies evolve organically, are authored by multiple stakeholders, and are expressed in prose. The flexibility that makes them attractive also creates a blind spot: two individually sensible rules can intersect in a way that forces the agent to choose between them, yet the conflict remains invisible until the agent is exercised in a specific state.
Current safety and evaluation frameworks focus on inter‑policy checks (e.g., comparing a user request against a single rule) or on synthetic adversarial prompts that stress a model. They lack a mechanism to:
- Systematically enumerate rule pairs that could simultaneously apply.
- Validate that both rules are truly “governing” the same concrete situation.
- Observe how a model’s generation or tool usage reflects a resolution strategy (e.g., prioritizing one rule, compromising, or violating).
Because LLM agents often act in open‑ended environments—searching the web, invoking APIs, or manipulating files—undetected intra‑policy conflicts can manifest as subtle compliance violations, data leakage, or unintended tool usage. Detecting these conflicts before deployment is therefore a critical safety bottleneck.
What the Researchers Propose
The authors present the WIRE pipeline, a three‑stage framework that turns natural‑language policies into a machine‑readable rule representation, filters for genuine “hard collisions,” and then materializes concrete scenarios—called witnesses—where both rules are simultaneously active. WIRE consists of:
- Rule Extraction & PyRule Encoding: Parses the policy text, isolates atomic directives, and translates each into a logical clause expressed in a lightweight DSL named
PyRule. - Satisfiability Checking & Collision Detection: Uses a SAT solver to test whether two clauses can be true together under any state. Only pairs that survive this check are considered “hard‑collision candidates.”
- Witness Realization: Generates concrete state descriptions (e.g., a user query, a system context) that satisfy both clauses, then feeds these witnesses to the target LLM agent to observe its compliance decisions.
By grounding each rule in its original textual source and then evaluating model outputs against that source, WIRE preserves interpretability—researchers can trace a violation back to the exact sentence that was breached.
How It Works in Practice
The WIRE workflow can be visualized as a linear pipeline, but each component interacts tightly with the others:
1. Source‑Grounded Rule Extraction
Given a prompt policy (often several thousand words), a rule‑extraction module scans for imperative sentences, conditional clauses, and enumerated bullet points. Each extracted rule is stored with its original line number and surrounding context to maintain provenance.
2. PyRule Clause Generation
Each natural‑language rule is mapped to a logical predicate. For example, “Never share personal identifiers” becomes ¬share(id). The mapping is performed by a fine‑tuned LLM that has been trained on a curated dataset of policy‑to‑logic pairs, ensuring high fidelity while preserving the rule’s intent.
3. SAT‑Based Collision Screening
The set of all atomic clauses is fed into a SAT solver. The solver attempts to find a model (i.e., a hypothetical world state) where both clauses evaluate to true. If no such model exists, the pair is discarded as a “soft” conflict (they cannot co‑govern any realistic situation). Surviving pairs are flagged as “hard‑collision candidates.”
4. Witness Construction
For each candidate pair, a separate generation module crafts a concrete scenario that satisfies both logical predicates. This often involves prompting a secondary LLM to produce a user request, system variables, or tool arguments that fulfill the constraints. The result is a witness—a fully specified input that should trigger both rules.
5. Agent Evaluation & Compliance Judgement
The primary LLM agent is invoked with the witness as context. Its response (textual output, tool call, or action) is then compared against the original source rules using a rule‑judgement module. The module labels each rule as “compliant,” “violated,” or “inconclusive,” producing a resolution profile for that witness.
What sets WIRE apart is the end‑to‑end traceability: from raw policy text, through logical encoding, to a concrete test case, and finally to a human‑readable compliance verdict. This pipeline can be automated across any number of policies, making it scalable for large‑scale agent fleets.
Evaluation & Results
The authors applied WIRE to six publicly available prompt policies, ranging from open‑source AI assistants to commercial chatbot guardrails. Across these policies, WIRE extracted a total of 276 source rules, which were broken down into 560 atomic PyRule clauses. The pairwise comparison of clauses yielded 30,944 potential interactions; after SAT filtering, 170 hard‑collision candidate pairs remained. These candidates were instantiated into 1,402 concrete witnesses.
In a “policy‑only” evaluation—where the agent’s output was judged solely against the source rules—these witnesses generated 13,335 post‑generation trials. The compliance analysis revealed:
- Only 35.4% of trials resulted in joint compliance (both rules satisfied).
- The remaining 64.6% violated at least one of the governing rules, indicating a prevalent tendency for agents to prioritize one rule over another or to ignore a rule altogether.
Beyond raw percentages, the study uncovered distinct resolution patterns:
- Rule Dominance: Certain rules (e.g., safety‑first constraints) consistently overrode others (e.g., user‑preference directives).
- Tool‑Action Divergence: When agents were allowed to invoke external tools, they sometimes resolved conflicts by delegating to a tool that implicitly satisfied one rule while sidestepping the other.
- Contextual Sensitivity: Minor changes in the witness (e.g., phrasing of a user request) could flip the compliance outcome, highlighting the fragility of intra‑policy interactions.
These findings demonstrate that intra‑policy conflicts are not rare edge cases; they are a systematic source of non‑compliance that can be quantified and categorized using WIRE.
Why This Matters for AI Systems and Agents
For practitioners building LLM‑driven agents, WIRE offers a concrete safety net:
- Proactive Conflict Detection: Teams can run WIRE on their prompt policies before release, surfacing hidden rule clashes that would otherwise surface only in production incidents.
- Compliance Auditing: The source‑grounded judgments enable auditors to trace violations back to specific policy sentences, simplifying regulatory reporting.
- Iterative Prompt Engineering: By exposing which rules dominate or cause failures, engineers can rewrite or reorder policy clauses to achieve the desired hierarchy.
- Tool‑Orchestration Design: Understanding how agents leverage tools to resolve conflicts informs the design of safer tool‑calling frameworks.
These capabilities align directly with the needs of enterprises seeking trustworthy AI. For example, the UBOS platform overview provides a unified environment where WIRE‑style diagnostics could be integrated into continuous deployment pipelines, ensuring that every new prompt iteration passes a conflict‑resolution sanity check.
Moreover, marketing automation agents—such as those described in the AI marketing agents page—must balance brand guidelines, legal compliance, and personalization rules. WIRE can surface hidden tensions between these objectives before a campaign goes live, reducing the risk of brand‑damage or regulatory breaches.
What Comes Next
While WIRE marks a significant step forward, several limitations and open research avenues remain:
- Scalability to Massive Policies: The pairwise SAT check grows quadratically with the number of clauses. Future work could explore hierarchical clustering or approximate collision detection to handle policies with thousands of rules.
- Dynamic Policy Evolution: Policies are often updated incrementally. Incremental WIRE pipelines that only re‑evaluate changed sections would reduce computational overhead.
- Multi‑Agent Interactions: Current experiments focus on a single agent. Extending WIRE to evaluate conflicts that emerge across cooperating agents (e.g., a planner and an executor) is an exciting direction.
- Human‑in‑the‑Loop Feedback: Integrating expert judgments into the witness‑generation step could improve the realism of test scenarios, especially for domain‑specific policies.
From an application standpoint, integrating WIRE into existing AI orchestration tools could unlock new safety guarantees. The Workflow automation studio already supports custom validation steps; adding a WIRE module would let teams automatically reject deployments that exhibit a high conflict‑violation rate.
For developers who rely on third‑party LLM APIs, the OpenAI ChatGPT integration could be paired with WIRE‑generated test suites to monitor compliance drift as model updates roll out.
Finally, community‑driven repositories of benchmark policies and witness sets would accelerate research on intra‑policy conflict resolution, fostering a shared safety infrastructure across the AI ecosystem.
References
Yan, L., Chen, X., & Zhang, X. (2026). Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles. arXiv preprint arXiv:2605.27784.
