Updated: June 28, 2026
6 min read

Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards

Direct Answer

The paper introduces Adaptive Correct‑Only Efficiency Reward (ACOER), a training technique that lets large reasoning models become dramatically more concise without sacrificing accuracy. By rewarding brevity *only* when the answer is correct and dynamically normalizing token budgets, ACOER avoids the “reward collapse” that has plagued earlier length‑penalizing methods, cutting token usage by more than 60% while improving overall performance.

Background: Why This Problem Is Hard

Reasoning‑oriented language models—used for math, code, and complex decision‑making—are notoriously verbose. Each extra token inflates inference cost, latency, and cloud spend, which directly impacts the scalability of AI‑powered products. Researchers have tried to curb this by adding length‑penalizing terms to the reward signal during reinforcement learning, most notably within the Group Relative Policy Optimization (GRPO) framework.

However, two intertwined challenges have persisted:

Reward collapse: When the penalty applies to all outputs, the optimizer learns to suppress the penalty by producing nonsensical, ultra‑short strings, eroding reasoning ability.
Normalization bias: GRPO’s group‑level normalization amplifies small advantage differences, causing the model to favor brevity even on incorrect answers, which spirals into instability.

These issues make it difficult to train models that are both efficient and reliable—a gap that directly limits the deployment of AI agents in latency‑sensitive environments such as real‑time customer support, autonomous planning, and edge devices.

What the Researchers Propose

The authors present a three‑part solution called Adaptive Correct‑Only Efficiency Reward (ACOER):

Correct‑only brevity bonus: Tokens are rewarded for being short only when the answer passes a correctness check (e.g., exact match or verification script).
Dynamic budget normalization: Instead of a static length penalty, the system maintains a moving token‑budget target that adapts to the model’s recent performance, preventing runaway compression.
Control‑loop penalty adjustments: A feedback controller monitors token usage and accuracy, automatically scaling the brevity bonus up or down to keep both metrics within desired thresholds.

By isolating the efficiency incentive to correct outputs and continuously re‑balancing the reward magnitude, ACOER eliminates the structural loop that caused collapse in earlier GRPO variants.

How It Works in Practice

The ACOER workflow can be visualized as a loop with three interacting modules:

ACOER workflow diagram

1. Generation & Correctness Evaluation

The model generates a candidate answer. A lightweight verifier—often a rule‑based checker for math problems or a secondary model for open‑ended tasks—labels the response as correct or incorrect.

2. Adaptive Reward Assignment

If the answer is correct, the system computes a brevity bonus proportional to the difference between the token count and the current budget target. Incorrect answers receive no brevity bonus, only the standard task reward.

3. Budget & Control Loop

A moving average of recent token counts defines the budget target. A proportional‑integral‑derivative (PID) controller observes two signals—accuracy and token usage—and adjusts the scaling factor of the brevity bonus to keep the model within a predefined efficiency‑accuracy envelope.

What sets ACOER apart from prior approaches is the decoupling of length incentives from incorrect outputs and the continuous, data‑driven calibration of the reward magnitude, which together prevent the optimizer from exploiting the penalty loop.

Evaluation & Results

The authors benchmarked ACOER on three widely used mathematical reasoning suites: GSM8K, MATH, and MMLU‑Math. Each test set contains problems that require multi‑step reasoning and precise numeric answers.

Experimental Setup

Base model: 70B‑parameter LLM fine‑tuned with standard GRPO.
Comparison groups: (a) GRPO with uniform length penalty, (b) GRPO with correct‑only length penalty (no adaptive control), and (c) ACOER.
Metrics: Exact‑match accuracy, average tokens per answer, and a composite efficiency score (accuracy ÷ tokens).

Key Findings

Accuracy uplift: ACOER improved exact‑match scores by 2.3–4.1 percentage points across the three benchmarks compared with the baseline.
Token reduction: Average token count dropped from 78 tokens (baseline) to 30 tokens—a 61% reduction—without compromising correctness.
Stability: Training curves for ACOER remained smooth, whereas the uniform penalty variant exhibited sudden drops in accuracy after a few epochs, confirming the collapse phenomenon.
Generalization: When evaluated on unseen reasoning tasks (e.g., logical puzzles), ACOER retained its efficiency gains, suggesting the method is not over‑fitted to a specific dataset.

These results demonstrate that ACOER not only curbs verbosity but also nudges the model toward more disciplined reasoning pathways, likely because the brevity bonus rewards concise, logically tight solutions.

Why This Matters for AI Systems and Agents

For practitioners building AI agents, the trade‑off between speed, cost, and reliability is a daily concern. ACOER directly addresses this triad:

Cost efficiency: Reducing token consumption translates to lower API bills and makes large‑scale deployments financially viable.
Latency reduction: Shorter outputs mean faster round‑trip times, which is critical for real‑time assistants, autonomous decision loops, and interactive tutoring systems.
Reliability: By tying brevity rewards to verified correctness, agents avoid the “short‑answer” failure mode that can erode user trust.

Enterprises can embed ACOER‑trained models into the UBOS platform overview, leveraging the platform’s orchestration layer to serve efficient reasoning agents at scale. Moreover, the adaptive control loop aligns well with existing monitoring dashboards, enabling ops teams to set explicit efficiency targets and automatically adjust training dynamics without manual intervention.

What Comes Next

While ACOER marks a significant step forward, several avenues remain open for exploration:

Cross‑modal efficiency: Extending the correct‑only reward concept to multimodal generation (e.g., code + diagrams) could further cut compute.
Fine‑grained verification: Incorporating more sophisticated correctness checks—such as theorem provers or symbolic solvers—might tighten the reward signal for complex domains.
Integration with workflow automation: Pairing ACOER‑trained models with the Workflow automation studio could automate the budgeting feedback loop across heterogeneous pipelines, making efficiency a first‑class service.
Robustness to distribution shift: Future work should assess how ACOER behaves when the verification criteria evolve or when models encounter out‑of‑distribution prompts.

Addressing these challenges will help the community move from “efficient in‑training” to “efficient in‑deployment,” ensuring that AI agents remain both powerful and economical as they scale to ever larger user bases.

Conclusion

Adaptive Correct‑Only Efficiency Reward (ACOER) offers a principled, stable pathway to train large reasoning models that are both concise and accurate. By isolating brevity incentives to verified correct answers and employing a dynamic control loop, the method sidesteps the reward collapse that has limited prior length‑penalizing strategies. Empirical results across major math reasoning benchmarks confirm that ACOER can slash token usage by over 60% while delivering measurable accuracy gains. For AI engineers, product teams, and enterprises, ACOER unlocks a new efficiency frontier—lower costs, faster responses, and more trustworthy agents—making it a compelling addition to any modern LLM optimization toolkit.

For a deeper dive into the methodology and to explore the full set of experiments, see the original arXiv paper. To start building efficient AI agents today, visit the UBOS homepage and explore the suite of integrations and tools designed for next‑generation AI applications.

Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

1. Generation & Correctness Evaluation

2. Adaptive Reward Assignment

3. Budget & Control Loop

Evaluation & Results

Experimental Setup

Key Findings

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Carlos

Image to text with Claude 3

AI Video Generator

Customer Relationship Management (CRM)

AI Chatbot Starter Kit v0.1

Python Bug Fixer

AI Chatbot Starter Kit

Sign up for our newsletter

Direct Answer

Background: Why This Problem Is Hard

What the Researchers Propose

How It Works in Practice

1. Generation & Correctness Evaluation

2. Adaptive Reward Assignment

3. Budget & Control Loop

Evaluation & Results

Experimental Setup

Key Findings

Why This Matters for AI Systems and Agents

What Comes Next

Conclusion

Share

Carlos

Sign up for our newsletter

Sign In

Register

Reset Password