- Updated: March 11, 2026
- 7 min read
Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning
Direct Answer
The paper introduces EfficientZero‑Multitask (EZ‑M), a model‑based reinforcement learning (MBRL) algorithm that scales the number of tasks rather than the number of samples per task to achieve rapid, sample‑efficient mastery of whole‑body humanoid control. By sharing a single dynamics model across diverse tasks, EZ‑M dramatically reduces the data required for learning complex motor skills, positioning task scaling as a new lever for building generalist robotic agents.
Background: Why This Problem Is Hard
Robotics researchers have long chased the vision of a single robot that can learn a wide array of motor skills—walking, climbing, object manipulation—without hand‑crafted controllers for each behavior. The difficulty stems from three intertwined challenges:
- Sample inefficiency: Real‑world robots can only collect a few thousand interaction steps before wear‑and‑tear or safety concerns become prohibitive. Model‑free RL methods typically need millions of steps per task, making them impractical for on‑policy learning.
- Task interference: When a single neural network is trained on multiple objectives, gradients from conflicting tasks can cancel each other out, leading to unstable learning or sub‑optimal policies.
- Dynamics invariance: Physical laws (gravity, joint limits, contact dynamics) are constant across tasks, yet most existing approaches treat each task in isolation, missing the opportunity to reuse this shared knowledge.
Recent trends in large‑scale AI—scaling model parameters and offline datasets—have shown impressive results in language and vision, but they rely on massive static corpora. In robotics, the data must be generated online, and the cost of scaling samples is far higher. Consequently, the community has been searching for a more efficient axis of scaling that respects the constraints of embodied learning.
What the Researchers Propose
EZ‑M reframes the scaling problem: instead of gathering more experience per task, it expands the *variety* of tasks presented to a single learning system. The core hypothesis is that a shared world model can aggregate experience from many tasks, learning a richer, task‑agnostic representation of the robot’s dynamics. This representation then serves as a stable foundation for multiple downstream policies.
The framework consists of three key components:
- Unified Dynamics Model: A neural network that predicts next‑state distributions given the current state and action, trained on data pooled from all tasks.
- Task‑Specific Planner: For each task, a lightweight planning module (e.g., Monte‑Carlo Tree Search) queries the shared model to generate short‑horizon action sequences that maximize a task‑specific reward.
- Online Data Scheduler: An orchestrator that selects which task to sample next, balancing exploration across tasks to keep the dynamics model well‑conditioned.
By decoupling dynamics learning from policy optimization, EZ‑M sidesteps the gradient interference that plagues model‑free multitask RL. The dynamics model benefits from the diversity of experiences, while each planner remains focused on its own objective.
How It Works in Practice
The operational flow of EZ‑M can be visualized as a loop that repeats until the robot reaches a predefined performance threshold:
- Task Selection: The scheduler picks a task (e.g., “stand up”, “walk forward”, “turn left”) based on a curiosity‑driven metric that favors under‑explored regions of the state‑action space.
- Planning Phase: The task‑specific planner runs a short horizon search using the current dynamics model to propose a candidate action sequence.
- Execution & Data Capture: The robot executes the first action of the sequence, observes the resulting state, and records the transition (state, action, next‑state, reward).
- Model Update: The unified dynamics model is updated with the new transition, using a replay buffer that mixes data from all tasks.
- Policy Refinement: The planner’s internal value estimates are refreshed using the updated model, preparing it for the next planning iteration.
What distinguishes this pipeline from conventional model‑free multitask RL is the *single point of truth* for physics: the dynamics model. Because the model is never tied to a specific reward signal, it can safely ingest data from tasks that demand opposite actions in similar states (e.g., “move left” vs. “move right”). This eliminates the destructive interference that would otherwise degrade learning.
Evaluation & Results
To validate EZ‑M, the authors introduced HumanoidBench, a suite of whole‑body control tasks built on a high‑fidelity simulated humanoid. The benchmark includes ten distinct objectives ranging from basic standing to complex acrobatic maneuvers, each requiring coordinated control of over 30 joints.
Key experimental findings:
- Sample Efficiency: EZ‑M reached expert‑level performance on 8 out of 10 tasks using fewer than 2 million environment steps, a 4‑fold reduction compared to the strongest model‑free baseline.
- Scalability with Task Count: Adding more tasks consistently improved the dynamics model’s prediction accuracy, even when the total number of samples remained constant.
- Robustness to Conflict: In paired tasks with opposite optimal actions (e.g., “walk forward” vs. “walk backward”), EZ‑M maintained stable learning curves, whereas model‑free multitask agents exhibited severe performance drops.
- Parameter Economy: EZ‑M achieved these gains without resorting to billion‑parameter networks; a modest 120 M‑parameter model sufficed, underscoring that task diversity, not sheer size, drives the improvements.
These results collectively demonstrate that task scaling can serve as a practical lever for accelerating embodied learning, confirming the authors’ hypothesis that shared dynamics are a natural regularizer for MBRL.
Why This Matters for AI Systems and Agents
For practitioners building real‑world robotic platforms, EZ‑M offers a concrete pathway to reduce the costly data collection phase that traditionally dominates development timelines. By leveraging a single dynamics model across many behaviors, engineers can:
- Deploy new skills with minimal additional training, accelerating product iteration cycles.
- Maintain a unified simulation‑to‑real pipeline, since the dynamics model can be fine‑tuned with a few real‑world rollouts and instantly benefit all existing policies.
- Integrate with existing orchestration frameworks that schedule tasks, enabling automated curriculum generation at scale.
In the broader AI ecosystem, the paper’s insights align with emerging trends in modular robotics platforms, where interchangeable skill modules rely on a common perception‑action backbone. EZ‑M’s task‑centric planning also dovetails with multi‑agent coordination strategies, suggesting that a shared world model could serve multiple agents operating in the same environment without redundant learning.
What Comes Next
While EZ‑M marks a significant step forward, several open challenges remain:
- Real‑World Transfer: The current evaluation is confined to simulation. Bridging the sim‑to‑real gap will require robust domain randomization or online adaptation mechanisms.
- Task Definition Granularity: Determining the optimal granularity of tasks (fine‑grained primitives vs. coarse‑grained goals) is still an open research question that impacts both data efficiency and planner complexity.
- Scalable Planning: As the number of tasks grows, the planner’s computational budget may become a bottleneck. Hierarchical or learned planners could alleviate this pressure.
- Safety Guarantees: In safety‑critical deployments, ensuring that the shared dynamics model does not propagate erroneous predictions across tasks is essential.
Future work may explore integrating EZ‑M with multi‑agent orchestration layers that dynamically allocate tasks based on robot health, environmental context, or user intent. Additionally, extending the framework to heterogeneous robot fleets—where each robot shares a common dynamics core but differs in morphology—could unlock unprecedented levels of cross‑platform learning.
References
For a complete technical description, see the original preprint: Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi‑Task Model‑Based Reinforcement Learning.