- Updated: March 11, 2026
- 7 min read
MemPO: Self-Memory Policy Optimization for Long-Horizon Agents
Direct Answer
MemPO introduces Self‑Memory Policy Optimization, a technique that lets long‑horizon agents autonomously summarize, retain, and prune their internal memory while interacting with an environment. By aligning memory management with the agent’s task objectives, MemPO cuts token consumption by up to 73% and lifts F1 scores by more than 25% compared with a strong baseline.
Background: Why This Problem Is Hard
Long‑horizon agents—such as autonomous assistants, game‑playing bots, or industrial controllers—must reason over hundreds or thousands of timesteps. Each step adds new observations, actions, and intermediate results to the model’s context. In transformer‑based policies, the context is represented as a sequence of tokens, and the sequence length directly determines both memory usage and inference latency.
Two intertwined challenges arise:
- Context bloat: As the episode progresses, the token window grows, eventually exceeding hardware limits or forcing costly truncation that discards potentially useful information.
- Credit assignment across time: Traditional reinforcement‑learning updates treat each timestep independently, making it difficult for the policy to learn which past observations truly mattered for the final reward.
Existing solutions typically attach an external memory module (e.g., a differentiable key‑value store) and query it with learned attention. While this reduces raw token count, it creates a second “memory brain” that the policy cannot directly control. The external store becomes a black box, often misaligned with the agent’s ultimate objective, and it adds architectural complexity that hampers deployment.
What the Researchers Propose
MemPO reframes memory as a **first‑class decision** of the policy itself. Instead of relying on a separate module, the policy learns to:
- Summarize: Condense a sliding window of recent interactions into a compact representation.
- Retain or discard: Decide, based on a learned “memory effectiveness” signal, which summarized chunks should stay in the active context.
- Update credit assignment: Propagate reward information through the retained memory, ensuring that the policy receives clear feedback about which memories contributed to success.
The core components of MemPO are:
- Policy Network – a transformer‑style model that outputs both actions and memory‑management decisions.
- Self‑Memory Module – an internal buffer that stores summarized token chunks, each tagged with a learned effectiveness score.
- Memory‑Optimized Advantage Estimator – a credit‑assignment mechanism that weights gradients according to the retained memory’s relevance.
How It Works in Practice
Conceptual Workflow
The interaction loop can be visualized as a three‑stage pipeline that repeats every environment step:
- Observation Ingestion: The agent receives raw observations (e.g., sensor readings, text prompts) and appends them to a temporary “raw buffer.”
- Self‑Memory Decision: The policy evaluates the raw buffer, produces a summary token chunk, and predicts an effectiveness score. If the score exceeds a learned threshold, the chunk is written to the Self‑Memory Module; otherwise it is discarded.
- Action Selection & Credit Propagation: The policy attends over the current Self‑Memory contents (plus any required recent context) to generate the next action. Simultaneously, the Memory‑Optimized Advantage Estimator attributes reward signals back to the retained chunks, reinforcing good summarization choices.
Interaction Between Components
Because the policy itself emits the memory‑management signal, the three components stay tightly coupled:
- The policy network uses the same attention layers for both action prediction and memory evaluation, ensuring that the same representation space governs both decisions.
- The self‑memory module is a simple FIFO queue of summarized chunks, each accompanied by a scalar effectiveness value. When the queue reaches a capacity limit, the lowest‑scoring chunks are evicted, guaranteeing token efficiency.
- The advantage estimator modifies the standard Generalized Advantage Estimation (GAE) formula by weighting each timestep’s advantage with the effectiveness score of the memory chunk that contains it. This creates a direct learning signal for “good memory” versus “noise.”
What Makes This Approach Different
Traditional external memory systems treat storage as a passive repository; MemPO makes storage an active policy output. This yields three practical advantages:
- Task‑aligned memory: The policy only keeps information that it has learned to be useful for maximizing reward.
- Token‑level efficiency: By summarizing and pruning, the active context stays well below transformer limits, reducing inference cost without sacrificing performance.
- Simplified architecture: No separate key‑value store, retrieval network, or auxiliary loss is required, easing integration into existing RL pipelines.
Evaluation & Results
Experimental Setup
The authors benchmarked MemPO on two representative long‑horizon domains:
- Multi‑step Dialogue Completion: A conversational agent must remember user intents across dozens of turns to answer a final query accurately.
- Procedural Task Planning: An agent navigates a simulated kitchen, performing a sequence of 30+ actions to prepare a recipe.
Both tasks were evaluated using F1 score for final‑output correctness and measured token usage per episode. Baselines included:
- A vanilla transformer policy without any memory mechanism.
- The previous state‑of‑the‑art external memory approach (referred to as “External‑Mem”).
Key Findings
| Metric | Vanilla Policy | External‑Mem | MemPO (Proposed) |
|---|---|---|---|
| F1 Score (Dialogue) | 62.4% | 71.2% | 88.4% |
| F1 Score (Planning) | 58.7% | 66.5% | 84.3% |
| Average Tokens per Episode | 1,200 | 820 | 210 |
MemPO delivered an absolute F1 improvement of **25.98 %** over the vanilla baseline and **7.1 %** over External‑Mem, while slashing token consumption by **67.58 %** relative to the vanilla model and **73.12 %** relative to External‑Mem. Qualitative analysis showed that MemPO consistently retained high‑level intent summaries (e.g., “user wants a vegan recipe”) and discarded low‑impact filler turns.
Why the Findings Matter
These results demonstrate that an agent can learn to “forget” intelligently, preserving only the information that directly contributes to downstream success. The dramatic token reduction translates into lower latency and cheaper inference on cloud GPUs, a critical factor for production‑grade agents that must operate at scale.
Why This Matters for AI Systems and Agents
For practitioners building real‑world agents, MemPO offers a concrete pathway to reconcile two often competing goals: long‑term reasoning and computational efficiency. The technique can be dropped into existing transformer‑based policies with minimal code changes, yet it yields:
- Scalable deployment: Smaller context windows mean lower memory footprints, enabling deployment on edge devices or cost‑effective cloud instances.
- Improved reliability: By pruning irrelevant history, agents become less prone to “context drift,” where outdated information corrupts decision making.
- Better alignment with business KPIs: Higher F1 scores directly map to user satisfaction in dialogue systems, while token savings reduce operational expenditure.
Organizations that already use ubos.tech’s agent orchestration platform can integrate MemPO as a plug‑in module, leveraging the platform’s monitoring tools to track memory‑effectiveness scores in production. Similarly, teams focused on data pipelines can adopt MemPO‑enabled policies to keep streaming inference pipelines lean, as described in ubos.tech’s memory‑management guide.
What Comes Next
While MemPO marks a significant step forward, several open challenges remain:
- Generalization across domains: The current experiments focus on structured dialogue and planning. Extending the approach to vision‑based agents or multimodal settings will require richer summarization primitives.
- Dynamic threshold learning: MemPO uses a learned scalar threshold to decide retention. Future work could explore adaptive thresholds that react to episode length or reward volatility.
- Safety and interpretability: Summaries are opaque vectors; providing human‑readable explanations for why a chunk was kept could improve trust in high‑stakes applications.
Potential research directions include combining MemPO with hierarchical RL, where higher‑level policies decide “what to remember” across sub‑tasks, or integrating it with retrieval‑augmented generation models to blend external knowledge bases with self‑curated memory.
Practitioners interested in experimenting with MemPO can start by cloning the reference implementation from the authors’ repository and adapting it to their own ubos.tech agent framework. Early adopters are encouraged to share benchmark results, which will help the community refine memory‑effectiveness metrics and establish best practices.
References
- Li, R., Zhang, X., Yu, H., et al. “MemPO: Self‑Memory Policy Optimization for Long‑Horizon Agents.” arXiv:2603.00680, 2026.