- Updated: March 11, 2026
- 7 min read
Words & Weights: Streamlining Multi-Turn Interactions via Co-Adaptation

Direct Answer
The paper introduces ROSA2 (Refine‑On‑Semantic‑And‑Adapt‑Weights), a test‑time co‑adaptation framework that simultaneously rewrites ambiguous user prompts and fine‑tunes model parameters during a multi‑turn conversation. By treating words and weights as a coupled optimization problem, ROSA2 reduces the number of interaction turns and the magnitude of weight updates needed to achieve high‑accuracy results, making real‑time LLM alignment more efficient and reliable.
Background: Why This Problem Is Hard
Large language models (LLMs) excel at zero‑shot reasoning, yet they still stumble when users pose vague or evolving queries. In production systems—customer‑support bots, code assistants, or tutoring agents—each turn of a conversation is an opportunity for the model to either clarify intent or demonstrate capability. Two dominant test‑time adaptation strategies have emerged:
- Prompt Engineering: Adjusting the textual context (e.g., adding clarifying instructions) while keeping the model weights frozen.
- Test‑Time Training (TTT): Updating the model’s parameters on‑the‑fly using gradient signals derived from the current interaction.
Both approaches treat the problem as a single axis. Prompt engineering can resolve ambiguity but cannot compensate for a model’s intrinsic knowledge gaps. Conversely, TTT can patch capability deficits but often fails when the input prompt is underspecified, leading to noisy gradients and unstable updates. In real‑world deployments, failures typically arise from a *mix* of semantic ambiguity and capability insufficiency, which single‑axis methods cannot address efficiently.
Moreover, test‑time weight updates are expensive: they consume compute, risk catastrophic forgetting, and may violate latency constraints. Prompt‑only solutions, while cheap, hit a ceiling when the underlying model lacks the requisite skill. The industry therefore needs a method that leverages the strengths of both worlds without their respective drawbacks.
What the Researchers Propose
ROSA2 reframes test‑time adaptation as a joint optimization over two heterogeneous spaces:
- Words – the textual prompt, system messages, and any clarifying follow‑ups that the agent can emit.
- Weights – the model’s internal parameters, which can be nudged using gradient descent during the interaction.
The core insight is that improving semantic clarity preconditions effective weight updates. ROSA2 decomposes the overall error signal into two components:
- A semantic gradient that points to the most informative wording changes needed to disambiguate the user’s intent.
- A capacity gradient that identifies the minimal parameter shift required to close the remaining performance gap.
By iteratively applying these gradients—first refining the prompt, then adjusting weights, and repeating as needed—ROSA2 converges faster and with smaller weight changes than any method that treats words or weights in isolation.
How It Works in Practice
The ROSA2 workflow can be visualized as a loop that runs at each turn of a conversation:
- Initial Turn: The user submits a query. The system generates a baseline response using the frozen LLM.
- Error Detection: A lightweight evaluator (e.g., a correctness classifier or a confidence estimator) flags the response as unsatisfactory.
- Semantic Refinement: Using the semantic gradient, ROSA2 proposes a revised prompt that adds clarifying questions, re‑phrases ambiguous terms, or injects domain‑specific cues.
- Weight Update: With the refined prompt, the system computes a capacity gradient on a small, on‑device dataset (often the recent interaction history) and performs a constrained parameter update (e.g., low‑rank adaptation or LoRA).
- Re‑generation: The LLM produces a new answer using the updated weights and refined prompt. The loop repeats until the evaluator signals success or a turn limit is reached.
Key differentiators of ROSA2 include:
- Bidirectional Influence: Prompt changes directly affect the gradient landscape for weight updates, and vice versa.
- Parameter‑Shift Minimization: The theoretical analysis guarantees that the required weight shift is strictly smaller than that of a pure TTT approach.
- Turn‑Efficiency: By resolving ambiguity early, ROSA2 often needs fewer interaction cycles to reach a satisfactory answer.
Implementing ROSA2 in an existing LLM pipeline involves adding two lightweight modules—semantic‑gradient generator and capacity‑gradient updater—around the core inference engine. Because the weight updates are low‑rank, they can be executed on commodity GPUs or even edge accelerators without breaking latency Service Level Agreements (SLAs).
Evaluation & Results
The authors evaluated ROSA2 on the original arXiv paper benchmark suite, focusing on the MATH dataset, a collection of multi‑step mathematical reasoning problems that stress both prompt clarity and model capability.
Experimental Setup
- Baselines: Standard Prompt Engineering (PE), pure Test‑Time Training (TTT), and a hybrid that applies PE followed by TTT sequentially.
- Metrics: Accuracy (percentage of fully correct solutions), average number of interaction turns, and average parameter shift measured in L2 norm.
- Compute Budget: All methods were constrained to the same wall‑clock time per query (≈ 500 ms) to ensure a fair latency comparison.
Key Findings
| Method | Accuracy | Avg. Turns | Avg. Parameter Shift (L2) |
|---|---|---|---|
| Prompt Engineering | 62 % | 3.8 | 0.0 |
| Test‑Time Training | 68 % | 4.2 | 0.45 |
| Sequential PE → TTT | 71 % | 3.6 | 0.38 |
| ROSA2 (co‑adaptation) | 92 % | 2.3 | 0.21 |
ROSA2 achieved a 30 % absolute lift in accuracy over the strongest baseline while cutting the average number of turns by 40 %. Crucially, the required parameter shift was less than half of what pure TTT needed, confirming the theoretical claim that semantic refinement reduces the burden on weight updates.
Additional ablation studies showed that removing either the semantic or the capacity component caused performance to drop back to baseline levels, underscoring the necessity of the joint optimization.
Why This Matters for AI Systems and Agents
For practitioners building conversational agents, ROSA2 offers a pragmatic path to higher reliability without sacrificing latency:
- Reduced User Friction: Fewer clarification turns translate directly into smoother user experiences, especially in time‑sensitive domains like finance or healthcare.
- Lower Compute Costs: Smaller weight updates mean less GPU memory churn and lower energy consumption, which is attractive for large‑scale deployments.
- Improved Safety: By grounding weight updates in clarified prompts, the risk of unintended behavior caused by noisy gradients is mitigated.
- Modular Integration: ROSA2 can be layered on top of existing agent orchestration platforms, allowing teams to adopt co‑adaptation without redesigning their entire stack.
In the context of LLM pipeline integration, ROSA2 acts as a plug‑in that intercepts the inference call, performs a quick semantic analysis, and optionally triggers a low‑rank adaptation step. This fits neatly into CI/CD workflows for model updates, enabling continuous improvement based on live user interactions.
What Comes Next
While ROSA2 marks a significant step forward, several open challenges remain:
- Scalability to Larger Models: The current experiments used a 13‑billion‑parameter backbone. Extending co‑adaptation to 100‑billion‑parameter systems will require more efficient gradient approximation techniques.
- Generalization Across Domains: The MATH benchmark stresses reasoning; other domains (e.g., legal text, code synthesis) may exhibit different ambiguity‑capacity trade‑offs that need domain‑specific semantic gradient designs.
- Robustness to Adversarial Prompts: An attacker could craft inputs that manipulate the semantic refinement loop, potentially steering weight updates toward malicious behavior. Defensive mechanisms are an important research direction.
- Human‑in‑the‑Loop Extensions: Incorporating real user feedback (e.g., thumbs‑up/down) as part of the error signal could further tighten the co‑adaptation loop, but raises privacy and latency considerations.
Future work may explore hierarchical co‑adaptation, where higher‑level policy modules decide when to invoke semantic refinement versus weight adaptation, or meta‑learning approaches that pre‑train the system to predict optimal co‑adaptation schedules.
Overall, ROSA2 demonstrates that treating language and parameters as mutually reinforcing levers unlocks a new efficiency frontier for test‑time LLM alignment. As enterprises continue to embed LLMs into mission‑critical workflows, frameworks that can adapt on the fly while preserving safety and cost‑effectiveness will become indispensable.