✨ From vibe coding to vibe deployment. UBOS MCP turns ideas into infra with one message.

Learn more
Carlos
  • Updated: June 25, 2026
  • 6 min read

Coherence Under Commitment: Probing Generalization and Vacuous Memorization in LLM Logical Reasoning

Direct Answer

The paper introduces Coherence Under Commitment (CUC), a dual‑query evaluation framework that simultaneously measures logical consistency and decisive commitment in large language models (LLMs). By quantifying how often a model abstains versus makes a firm entailment or refutation, CUC reveals when apparent coherence is merely vacuous, a pitfall that traditional metrics miss.

Background: Why This Problem Is Hard

LLMs are increasingly deployed as reasoning engines in knowledge‑intensive domains such as legal analysis, scientific discovery, and autonomous agents. In these settings, a model must not only avoid contradictions (e.g., saying both “A” and “¬A”) but also provide actionable answers. Existing evaluation pipelines focus on negation consistency—checking that a model’s “YES” and “NO” responses are logically aligned—but they ignore the model’s willingness to commit. A model can achieve perfect consistency simply by answering “I don’t know” to every query, delivering zero utility while still passing coherence checks.

Current benchmarks (e.g., LogiQA, FOLIO) report high consistency scores for many open‑weight LLMs, yet they provide no visibility into coverage or the trade‑off between certainty and correctness. This blind spot hampers developers who need to know whether a model will actually make decisions in production or merely hide behind uncertainty.

What the Researchers Propose

The authors present a three‑pronged framework—CUC—that reframes logical‑reasoning evaluation around both coherence and commitment:

  • Commitment Score: A scalar c(φ) = p(φ) + p(¬φ) that captures the total probability mass a model assigns to a decisive outcome (either entailment or refutation). Higher scores indicate that the model is willing to take a stance.
  • Deterministic Elicitation Protocol: Instead of sampling multiple generations, the protocol extracts normalized log‑probabilities for the tokens “YES” and “NO”. This eliminates stochastic variance and yields reproducible scores.
  • 3‑Way Decision Framework: The model’s output is mapped to one of three buckets—True, False, or Uncertain—based on a configurable threshold on the commitment score. This operationalizes the coherence‑commitment trade‑off into concrete metrics: coverage (percentage of queries with a firm decision) and negation violation rate (how often True/False pairs contradict each other).

Collectively, these components let researchers plot a “frontier” that shows how much coverage can be achieved before contradictions rise, exposing vacuous memorization that would otherwise be hidden.

How It Works in Practice

The CUC workflow consists of four sequential stages:

  1. Prompt Construction: For each logical statement φ, two prompts are generated—one asking “Is φ true?” and another asking “Is ¬φ true?” Both prompts are fed to the same LLM.
  2. Log‑Probability Extraction: The model returns log‑probabilities for the tokens “YES” and “NO”. These are normalized to produce probabilities p(φ) and p(¬φ).
  3. Commitment Scoring: The commitment score c(φ) is computed as the sum of the two probabilities. If c(φ) exceeds a pre‑defined threshold τ, the model is considered “committed”. Otherwise, it is labeled “Uncertain”.
  4. Decision Mapping: A committed model’s higher‑probability side determines the final label (True or False). The three‑way decision (True/False/Uncertain) is recorded for downstream analysis.

What sets CUC apart is its deterministic nature—no random sampling, no temperature tuning—so results are reproducible across runs and hardware. Moreover, by treating the two complementary queries as a single joint evaluation, CUC directly captures the logical relationship between a statement and its negation.

Evaluation & Results

The authors applied CUC to four open‑weight LLMs ranging from 1 B to 3 B parameters, testing on 204 examples from the FOLIO logical‑reasoning suite. Key observations include:

  • Coverage vs. Consistency Trade‑off: Qwen2.5‑3B achieved an almost negligible negation violation rate (≈0.025) but only answered 7.4 % of queries decisively. In contrast, TinyLlama‑1.1B covered 79.4 % of items but incurred a violation on every example.
  • Frontier Generalization: When the same CUC analysis was run on the LogiQA v2 benchmark, the coverage‑consistency frontier persisted with a correlation coefficient of 0.97, indicating that the phenomenon is not dataset‑specific.
  • Vacuous Memorization Detection: Models that appeared “perfectly coherent” under traditional metrics were re‑ranked when CUC’s coverage dimension was considered, exposing a class of models that memorize training data without genuine reasoning ability.

These findings demonstrate that CUC can differentiate between true logical competence and superficial consistency, providing a more nuanced picture of LLM capabilities.

Why This Matters for AI Systems and Agents

For practitioners building AI‑driven agents, the CUC framework offers actionable insights:

  • Risk Management: Knowing the exact coverage of decisive answers helps quantify the uncertainty budget of an autonomous system, crucial for safety‑critical deployments.
  • Model Selection: CUC enables a fair comparison of models not just on accuracy but on the willingness to act, guiding choices between a highly cautious model and a more assertive one.
  • Orchestration Strategies: Agents can be designed to fall back to alternative tools (e.g., retrieval‑augmented generation or external knowledge bases) when CUC flags “Uncertain” responses, improving overall reliability.
  • Product Integration: Embedding CUC into evaluation pipelines aligns with best practices for UBOS platform overview, ensuring that AI components meet both consistency and commitment standards before release.

What Comes Next

While CUC marks a significant step forward, several avenues remain open:

  • Threshold Optimization: Adaptive thresholds that consider task difficulty or downstream cost could make the 3‑way decision more context‑aware.
  • Scaling to Larger Models: Extending CUC to multi‑billion‑parameter LLMs will test whether the observed frontier persists at scale.
  • Integration with Agent Frameworks: Coupling CUC with Workflow automation studio could automate the rerouting of uncertain queries to specialized modules.
  • Human‑in‑the‑Loop Feedback: Leveraging user corrections on “Uncertain” outputs can iteratively refine the commitment threshold and improve model calibration.
  • Broader Benchmarks: Applying CUC to multimodal reasoning tasks or real‑world QA datasets will validate its generality.

By addressing these challenges, the community can move toward evaluation regimes that reward genuine reasoning rather than clever abstention. The authors have released an open‑source toolkit to standardize CUC assessments, inviting researchers and engineers to adopt the framework in their own pipelines.

References

Coherence Under Commitment: Probing Generalization and Vacuous Memorization in LLM Logical Reasoning – Noor Islam S. Mohammad & Mahmudul Hasan, arXiv 2026.


Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Sign up for our newsletter

Stay up to date with the roadmap progress, announcements and exclusive discounts feel free to sign up with your email.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.