✨ From vibe coding to vibe deployment. UBOS MCP turns ideas into infra with one message.

Learn more
Carlos
  • Updated: January 26, 2026
  • 6 min read

ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation


Direct Answer

The paper introduces Illocution‑Calibrated Policy Optimization (ICPO), a reinforcement‑learning framework that teaches large language models (LLMs) to recognize and respond to the illocutionary force of user utterances in multi‑turn dialogues. By aligning policy updates with the speaker’s intended act—whether asking, asserting, or seeking clarification—ICPO dramatically reduces “lost‑in‑conversation” failures and improves the model’s humility and clarification behavior.

Background: Why This Problem Is Hard

Multi‑turn conversational agents must continuously infer the user’s intent, track context, and decide when to answer, ask for clarification, or admit uncertainty. In practice, LLMs trained with standard supervised or reinforcement learning from human feedback (RLHF) excel at single‑turn tasks but stumble when the dialogue stretches over several exchanges. Two intertwined challenges explain this gap:

  • Illocutionary ambiguity: Users often embed multiple speech acts in a single prompt (e.g., “Can you tell me how to fix my laptop? Also, what’s the warranty?”). Traditional models treat the prompt as a flat text string, missing the nuanced intent behind each clause.
  • Contextual drift: As the conversation progresses, earlier clarifications or corrections can be forgotten, leading the model to repeat mistakes or provide over‑confident answers to questions it does not truly understand.

Existing RLHF pipelines address safety and factuality but rarely incorporate a formal representation of illocutionary force. Consequently, agents either over‑answer (risking hallucination) or under‑answer (failing to provide useful information), especially when faced with ambiguous or compound queries.

What the Researchers Propose

ICPO reframes policy optimization as a two‑step process:

  1. Illocution detection: A lightweight classifier parses each user turn to infer the dominant speech act(s)—question, request, confirmation, or clarification request.
  2. Calibrated policy update: The LLM’s response policy is conditioned not only on the textual context but also on the detected illocutionary label. The reinforcement signal rewards responses that correctly match the intended act (e.g., asking a follow‑up question when clarification is needed) and penalizes mismatches.

Key components include:

  • Illocution Encoder: a fine‑tuned transformer that outputs a probability distribution over a predefined set of illocutionary categories.
  • Policy Network: the base LLM whose logits are modulated by the encoder’s output via a gating mechanism, effectively biasing generation toward the appropriate act.
  • Reward Model: an extension of standard RLHF reward functions that incorporates an “act‑alignment” term, measuring how well the generated response fulfills the detected intent.

How It Works in Practice

The ICPO workflow can be visualized as a loop that repeats for every turn in a conversation:

ICPO diagram

  1. User input arrives. The raw text is fed simultaneously to the Illocution Encoder and the LLM’s context buffer.
  2. Illocution inference. The encoder predicts a distribution such as {question: 0.68, clarification: 0.22, statement: 0.10}. The highest‑probability label is selected as the target act.
  3. Policy conditioning. The LLM receives a concatenated embedding that combines the original token embeddings with a learned “act embedding” derived from the encoder’s output.
  4. Response generation. The model produces a token sequence. If the act is “clarification,” the response will typically be a probing question; if “statement,” it will provide a concise answer.
  5. Reward evaluation. The reward model scores the response on factuality, safety, and act‑alignment. The combined reward drives a proximal policy optimization (PPO) update.
  6. Loop continuation. The new system state (including any clarification asked) is stored, and the next user turn repeats the cycle.

What sets ICPO apart is the explicit separation of “what to say” (the content) from “how to say it” (the speech act). This decoupling lets the same underlying LLM be reused across domains while the illocutionary layer adapts to task‑specific conversational norms.

Evaluation & Results

Researchers benchmarked ICPO on two multi‑turn dialogue suites:

  • Multi‑Turn QA (MT‑QA): A dataset of 10,000 conversations where each question may contain multiple sub‑questions or implicit requests for clarification.
  • Customer‑Support Sim (CSS): Simulated support chats that require agents to ask follow‑up questions, admit uncertainty, or defer to human operators.

Key findings include:

  • ICPO reduced the “lost‑in‑conversation” error rate by 38 % compared to a strong RLHF baseline, measured as the proportion of turns where the model’s response failed to address the user’s primary intent.
  • Human evaluators rated ICPO‑augmented agents as 1.7 points higher on a 5‑point humility scale, indicating more frequent and appropriate clarification requests.
  • Overall task success (correct answer + appropriate act) improved from 62 % to 78 % on MT‑QA, demonstrating that act‑aware policies boost both accuracy and conversational flow.

Crucially, these gains were achieved without sacrificing the model’s fluency or increasing inference latency beyond 12 ms per turn, confirming that the illocutionary encoder adds minimal overhead.

Why This Matters for AI Systems and Agents

From a systems‑building perspective, ICPO offers a pragmatic path to more reliable conversational agents:

  • Improved user experience: By asking clarifying questions when needed, agents avoid the frustration of receiving irrelevant or incorrect answers.
  • Safer deployments: Act‑aligned rewards discourage over‑confident hallucinations, a common failure mode in production LLM services.
  • Modular integration: The illocution encoder can be swapped or fine‑tuned independently, enabling rapid adaptation to new domains (e.g., finance, healthcare) without retraining the entire LLM.
  • Orchestration benefits: In multi‑agent pipelines, the act label can serve as a routing signal, directing user turns to specialized sub‑agents (e.g., a fact‑retrieval module for questions, a policy‑clarification module for ambiguous requests). See our guide on agent orchestration strategies for practical patterns.

For developers building AI assistants, ICPO reduces the engineering burden of hand‑crafting fallback heuristics. Instead, the model learns when to be humble and when to proceed confidently, aligning more closely with human conversational norms.

What Comes Next

While ICPO marks a significant step forward, several open challenges remain:

  • Richer illocutionary taxonomies: The current set of five acts may be insufficient for nuanced domains such as legal advice or therapeutic chat, where speech acts like “express empathy” or “provide reassurance” are critical.
  • Cross‑lingual generalization: Extending the encoder to multilingual settings without sacrificing act detection accuracy is an active research frontier.
  • Long‑term consistency: Maintaining act alignment over dozens of turns still poses difficulties; future work could integrate memory‑augmented architectures to preserve intent signals.
  • Human‑in‑the‑loop refinement: Incorporating real‑time user feedback on act appropriateness could further tighten the reward loop.

Potential applications span from next‑generation virtual assistants to autonomous negotiation bots. Researchers interested in building on this work can explore the open‑source ICPO repository (hypothetical link) and experiment with custom act sets.

We encourage the community to evaluate ICPO on their own dialogue corpora, share findings, and contribute to a shared benchmark for act‑aware conversational performance. Together, we can move toward agents that not only know what to say but also understand *why* they are saying it.

For a deeper dive into the methodology and full experimental details, refer to the original paper on arXiv.


Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Sign up for our newsletter

Stay up to date with the roadmap progress, announcements and exclusive discounts feel free to sign up with your email.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.