Updated: June 23, 2026
6 min read

In LLM Reasoning, there is Irrationality on top of Value Misalignment

Direct Answer

The paper “In LLM Reasoning, there is Irrationality on top of Value Misalignment” introduces the concept of rational value risk—the gap between a language model’s actual reasoning output and the utility‑maximising response it could produce if it reasoned perfectly. The authors show that even well‑aligned models can behave irrationally during inference, and that this risk is highly sensitive to prompting, candidate selection, and verification strategies.

Background: Why This Problem Is Hard

Large language models (LLMs) have become the de‑facto interface for AI‑driven products, from chat assistants to autonomous agents. The industry’s primary safety focus has been value alignment: ensuring that a model’s outputs reflect a predefined utility function (e.g., user preferences, ethical guidelines). Alignment research typically measures success by comparing model responses to a static benchmark or a human‑rated score.

However, alignment alone does not guarantee that a model will reason in a way that maximises the intended utility. Two intertwined challenges emerge:

Reasoning irrationality: During multi‑step generation, a model may drift into low‑utility reasoning paths, even if each individual token aligns with the target distribution.
Evaluation blind spots: Standard metrics (e.g., accuracy, BLEU, or human preference) capture end‑point quality but rarely expose the intermediate decision‑making process that leads to sub‑optimal answers.

Existing mitigation techniques—such as chain‑of‑thought prompting, self‑consistency, or reinforcement learning from human feedback (RLHF)—address surface‑level errors but assume that the underlying reasoning process is already rational. The paper argues that this assumption is false: a model can be perfectly aligned yet still follow an irrational reasoning trajectory, producing answers that are mis‑aligned in practice.

What the Researchers Propose

The authors formalise the discrepancy as rational value risk (RVR). RVR quantifies the expected utility loss when a model’s deployed reasoning strategy deviates from the “rational counterpart”—the hypothetical response that would achieve the steepest expected utility increase at each reasoning step.

To operationalise RVR, the framework decomposes the estimation error into three independent sources:

Finite candidate set: In practice, we evaluate a limited number of generated continuations rather than the full distribution.
Finite prompt set: The set of prompts used to elicit reasoning is necessarily bounded, introducing sampling bias.
Imperfect verifier: Utility estimation relies on external evaluators (e.g., automated judges or human raters) that are themselves noisy.

By isolating these components, the framework enables researchers to measure how much each factor contributes to the overall irrationality, and to design targeted interventions (e.g., richer candidate pools, diversified prompts, or stronger verifiers).

How It Works in Practice

Conceptual Workflow

The practical pipeline consists of four stages:

Prompt Generation: A base prompt is crafted to elicit a reasoning chain. Multiple variants are produced by perturbing wording, temperature, or few‑shot examples.
Candidate Sampling: For each prompt variant, the LLM generates a set of candidate reasoning traces (e.g., 5–10 continuations). Each trace includes intermediate steps and a final answer.
Utility Verification: An external verifier—either a specialised scoring model (e.g., a math solver) or a human panel—assigns a utility score to each candidate based on task‑specific criteria.
Rational Counterpart Approximation: The candidate with the highest verified utility is treated as the empirical rational response. The average utility gap between this best candidate and the model’s default output defines the RVR.

Key Differences from Prior Approaches

Explicit risk decomposition: Instead of treating mis‑alignment as a monolithic error, the method isolates where irrationality originates.
Dynamic reasoning evaluation: The framework evaluates the entire reasoning trajectory, not just the final answer.
Model‑agnostic design: It can be applied to any LLM, regardless of size or training regime, because it relies on external verification rather than internal gradients.

Evaluation & Results

The authors conducted extensive experiments across a spectrum of models (Llama‑3.1, Qwen‑2.5, Tulu‑3 families ranging from 7B to 72B parameters, GPT‑5.2, GPT‑5.5, and DeepSeek‑V4) and benchmark suites (UltraFeedback, AlpacaEval, GSM8K, MATH, HumanEval, and MathArena). The evaluation focused on four research questions:

Is rational value risk observable across model families?
Does stronger alignment reduce RVR?
How sensitive is RVR to inference‑time reasoning strategies (e.g., temperature, chain‑of‑thought depth)?
Does extending the length of reasoning chains improve rationality, and if so, with diminishing returns?

Key observations:

Ubiquity of RVR: All tested models exhibited a measurable utility gap, even those that scored near‑perfect on standard alignment benchmarks.
Alignment mitigates but does not eliminate risk: Models fine‑tuned with RLHF showed a 30‑40% reduction in RVR compared to base models, yet a residual gap persisted.
Inference strategy matters: Low temperature (0.2) and deterministic sampling reduced RVR, while higher temperatures amplified irrational excursions.
Longer reasoning helps, but with diminishing returns: Extending chain‑of‑thought steps from 3 to 7 improved rationality by ~12%, but gains plateaued beyond 9 steps.

These findings collectively demonstrate that rational value risk is a distinct, quantifiable phenomenon that survives traditional alignment pipelines.

Why This Matters for AI Systems and Agents

For practitioners building AI‑driven agents, the presence of irrational reasoning can translate into costly errors—mis‑priced financial recommendations, faulty code generation, or unsafe autonomous decisions. Understanding RVR equips engineers with a diagnostic lens to:

Identify when a model’s answer is sub‑optimal despite appearing aligned.
Design prompting strategies that explicitly minimise the rationality gap (e.g., diversified prompt ensembles).
Integrate stronger verification modules, such as domain‑specific solvers, into the inference loop.
Allocate compute resources more efficiently by focusing on high‑utility candidate generation rather than brute‑force sampling.

In the broader ecosystem of AI orchestration, RVR informs the construction of Workflow automation studio pipelines that can automatically reroute low‑utility reasoning paths to specialised sub‑agents. It also guides the development of AI marketing agents that must balance persuasive language with factual correctness—an area where irrational reasoning can erode brand trust.

What Comes Next

While the paper establishes a solid foundation, several open challenges remain:

Verifier robustness: Current utility estimators are imperfect. Future work should explore self‑verifying LLMs that can critique their own reasoning.
Scalable candidate generation: Exhaustively sampling candidates is computationally expensive. Research into adaptive sampling or Bayesian optimisation could reduce overhead.
Cross‑domain generalisation: RVR has been measured on math and code tasks; extending the framework to dialogue, recommendation, or robotics domains will test its universality.
Integration with alignment training loops: Embedding RVR minimisation directly into RLHF or preference‑learning pipelines could produce models that are both aligned and rational by design.

Practically, organisations can start by augmenting their existing LLM pipelines with a lightweight RVR audit—sampling a handful of alternative prompts and scoring them with a domain‑specific verifier. Early adopters may find that even modest reductions in rational value risk translate into measurable improvements in user satisfaction and downstream business metrics.

For teams looking to experiment quickly, the UBOS platform overview provides plug‑and‑play components for prompt management, candidate sampling, and verification, enabling rapid prototyping of RVR‑aware agents.

References

Qian, K., & He, F. (2026). In LLM Reasoning, there is Irrationality on top of Value Misalignment. arXiv preprint arXiv:2606.20624.
OpenAI. (2023). ChatGPT: Optimizing Language Models for Dialogue.
Wei, J. et al. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

In LLM Reasoning, there is Irrationality on top of Value Misalignment

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Conceptual Workflow

Key Differences from Prior Approaches

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

References

Carlos

Unified Authorization Template

Calculate Time Complexity with ChatGPT API

AI Chatbot Starter Kit v0.1

AI-Powered Essay Outline Generator

Your Speaking Avatar

Image to text with Claude 3

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

Conceptual Workflow

Key Differences from Prior Approaches

Evaluation & Results

Why This Matters for AI Systems and Agents

What Comes Next

References

Share

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password