- Updated: March 11, 2026
- 6 min read
Dual‑Horizon Credit Assignment: Harmonizing Dense and Sparse Signals in Multi‑turn RL for Industrial Sales Agents
Direct Answer
The paper introduces Dual‑Horizon Credit Assignment (DuCA), a reinforcement‑learning framework that separates short‑term turn‑level signals from long‑term session‑level objectives when training large‑language‑model (LLM) sales agents. By normalizing advantages independently for each horizon, DuCA prevents high‑value session rewards from drowning out subtle linguistic cues, leading to more stable training and noticeably better commercial performance.
Background: Why This Problem Is Hard
Industrial sales conversations are a multi‑turn dance. An agent must stay compliant, sound natural, and keep the dialogue flowing (turn‑level goals) while simultaneously steering the interaction toward a conversion, upsell, or contract renewal (session‑level goals). Traditional reinforcement learning (RL) pipelines collapse these heterogeneous objectives into a single scalar reward. In practice this creates two major pain points:
- Reward imbalance: Session‑level metrics such as conversion rate often have magnitudes that dwarf turn‑level rewards for fluency or compliance. The optimizer therefore over‑fits to the big signal, ignoring linguistic quality.
- Credit‑assignment blur: When a conversion finally happens after many turns, it is unclear which specific utterances contributed most. Naïve credit assignment attributes the entire reward to the final action, encouraging “reward hacking” (e.g., spamming sales pitches).
Existing approaches—like standard Proximal Policy Optimization (PPO) with a single reward, or hierarchical RL that treats the whole session as a macro‑action—either suffer from unstable gradients or require hand‑crafted sub‑policies that do not scale to the nuanced language generation required in B2B sales.
What the Researchers Propose
DuCA tackles the imbalance by introducing a two‑track credit‑assignment pipeline:
- Turn‑level advantage stream: Computes advantages based on immediate linguistic rewards (e.g., fluency, compliance, relevance).
- Session‑level advantage stream: Computes advantages from strategic outcomes (e.g., conversion, revenue uplift).
The core innovation, Horizon‑Independent Advantage Normalization (HIAN), normalizes each advantage stream separately before merging them. This ensures that gradients from both horizons have comparable scale, allowing the policy update to respect both short‑term language quality and long‑term business objectives.
How It Works in Practice
Conceptual Workflow
The training loop for a DuCA‑enabled sales agent proceeds as follows:
- Interaction generation: The LLM produces a response given the current dialogue context.
- Reward evaluation: Two parallel evaluators assign:
- Turn‑level reward (e.g., compliance score, language smoothness).
- Session‑level reward (e.g., conversion flag, projected revenue).
- Advantage calculation: For each horizon, the algorithm computes the advantage (actual reward minus baseline estimate).
- HIAN processing: Each advantage vector is independently normalized (zero‑mean, unit‑variance) to remove magnitude bias.
- Fusion and policy update: The normalized advantages are summed, producing a balanced gradient that updates the LLM’s policy via a PPO‑style optimizer.
Component Interaction
Figure 1 (conceptual) would show three blocks: LLM Policy, Dual Reward Modules, and HIAN Normalizer. Data flows from the policy to the reward modules, then back through HIAN to the optimizer. The separation of reward streams is the only architectural change; the underlying LLM and PPO optimizer remain untouched, facilitating easy integration into existing pipelines.
What Sets DuCA Apart
- Scale‑agnostic normalization: HIAN works regardless of the absolute magnitude of rewards, making it robust to domain‑specific scaling choices.
- Minimal engineering overhead: No need for hierarchical policies or additional sub‑networks; the framework plugs into any RL‑fine‑tuned LLM.
- Explicit credit separation: By keeping advantage streams distinct until the final fusion, the method preserves interpretability—practitioners can inspect which horizon drove a particular policy change.
Evaluation & Results
Testbed Overview
The authors built a high‑fidelity user simulator that mimics B2B sales dialogues, including compliance constraints, product knowledge, and realistic buyer personas. The simulator provides both turn‑level feedback (e.g., language appropriateness) and session‑level outcomes (e.g., whether the buyer signs a contract).
Baselines
DuCA was compared against:
- GRPO (Generalized Reward‑Weighted Policy Optimization): A state‑of‑the‑art RL method that aggregates rewards into a single scalar.
- Standard PPO with weighted sum: Classic approach where engineers manually tune coefficients for turn‑ vs. session‑level rewards.
Key Findings
| Metric | GRPO | Weighted PPO | DuCA (proposed) |
|---|---|---|---|
| Conversion Rate (↑) | +0.00% (baseline) | +3.1% | +6.8% |
| Inter‑sentence Repetition (↓) | −45.2% | −68.9% | −82.3% |
| Identity Detection (↓) | −12.4% | −19.8% | −27.4% |
Beyond raw numbers, qualitative analysis showed that DuCA‑trained agents maintained a more natural conversational flow while still aggressively pursuing conversion goals. The reduction in repetition indicates that the turn‑level language model retained its generative diversity, a common failure mode when session rewards dominate.
Why This Matters for AI Systems and Agents
For practitioners building conversational sales assistants, DuCA offers a pragmatic path to reconcile two historically competing objectives:
- Business impact: Higher conversion rates translate directly to revenue, a critical KPI for any sales‑focused deployment.
- Customer experience: Lower repetition and better compliance improve buyer trust, reducing churn and negative brand perception.
- Operational efficiency: Because DuCA works with existing PPO pipelines, teams can adopt it without re‑architecting their model stacks, shortening time‑to‑value.
In the broader AI‑agent ecosystem, the dual‑horizon perspective can be generalized to any multi‑step task where immediate interaction quality and long‑term outcome diverge—think customer support, tutoring, or autonomous negotiation.
For more on integrating advanced RL techniques into production pipelines, see our research hub and about page for background on our engineering philosophy.
What Comes Next
While DuCA marks a significant step forward, several open challenges remain:
- Dynamic horizon weighting: Current implementation treats both horizons equally after normalization. Future work could learn a context‑aware weighting factor that emphasizes long‑term goals only when the conversation reaches a certain stage.
- Real‑world deployment validation: The paper’s results rely on a simulated buyer. Field trials with live sales teams will surface practical concerns such as latency, integration with CRM systems, and regulatory compliance.
- Multi‑objective extensions: Beyond conversion and language quality, enterprises care about metrics like average handle time, cross‑sell ratio, and sentiment. Extending DuCA to handle more than two horizons will test the scalability of HIAN.
We anticipate that the next generation of sales agents will combine DuCA with retrieval‑augmented generation and real‑time analytics, creating a feedback loop where live performance data continuously refines both horizons.
Interested in collaborating or learning more about how DuCA can be adapted to your product line? Reach out through our contact page.