✨ From vibe coding to vibe deployment. UBOS MCP turns ideas into infra with one message.

Learn more
Carlos
  • Updated: June 26, 2026
  • 7 min read

Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

Direct Answer

The paper introduces HALO (Hybrid Agent‑Learned Orchestrator), a supervised‑learning framework that trains a lightweight orchestrator to drive end‑to‑end PDDL planning using a pool of specialized LLM agents. By leveraging verifier‑approved refinement trajectories as direct supervision, HALO cuts orchestration costs by more than an order of magnitude while matching or surpassing the success rates of heavyweight frontier‑LLM baselines.

{{IMAGE}}

Background: Why This Problem Is Hard

Classical planners excel at generating provably correct plans from formal specifications written in the Planning Domain Definition Language (PDDL). In practice, however, users express goals in natural language, creating a translation gap that has persisted for decades. Bridging this gap requires two intertwined capabilities:

  • Semantic grounding: converting ambiguous, context‑rich utterances into precise PDDL predicates and objects.
  • Verification: ensuring that the generated plan satisfies all domain constraints and reaches the intended goal state.

Recent “agentic” pipelines address the first capability by deploying a collection of LLM‑based repair agents that iteratively refine a draft plan. A central orchestrator decides which agent to invoke at each step, and a verifier checks the intermediate result. The dominant implementations rely on prompting a frontier‑model (e.g., GPT‑5‑mini, Gemini‑3‑Flash) for every orchestration decision. This approach suffers from three critical drawbacks:

  1. Cost explosion: each API call incurs monetary and latency overhead, making large‑scale or real‑time deployment prohibitive.
  2. Sparse learning signal: prior work treats the entire episode as a single reward, offering little guidance for the orchestrator to improve its step‑by‑step policy.
  3. Scalability bottleneck: as the number of specialized agents grows, the combinatorial decision space overwhelms a pure prompting strategy.

These limitations are especially acute for enterprise AI platforms that must balance accuracy, speed, and cost while supporting dozens of planning domains.

What the Researchers Propose

HALO reframes orchestration as a supervised learning problem. Instead of learning from sparse episode‑level rewards, the authors collect full refinement trajectories that the verifier has already accepted as valid. Each trajectory consists of a sequence of (state, selected‑agent) pairs, providing a dense, high‑quality training signal.

The proposed framework consists of three core components:

  • Verifier‑guided dataset: an external verifier runs a traditional PDDL planner on the natural‑language goal, then records every successful (state, agent) decision made during the refinement loop.
  • Hybrid orchestrator: a small policy network fine‑tuned with QLoRA (quantized low‑rank adaptation) that predicts the next agent given the current state representation.
  • Rule‑based shortcuts: three deterministic rules handle trivially decidable selections (e.g., “no‑op” when the goal is already satisfied), ensuring that the learned policy focuses on the genuinely ambiguous decisions.

By training on 11 classic PDDL domains and expanding the action space to 21 specialized agents, HALO learns a generalizable orchestration strategy that can be applied to new domains with minimal adaptation.

The authors also release the full training pipeline and a benchmark suite (PlanBench, Natural Plan) to facilitate reproducibility.

For the full technical description, see the HALO paper.

How It Works in Practice

Conceptual Workflow

The end‑to‑end planning process under HALO follows a clear, repeatable loop:

  1. Goal ingestion: a user submits a natural‑language objective (e.g., “move all boxes to the loading dock”).
  2. Initial grounding: a lightweight language model extracts candidate objects, predicates, and a rough PDDL skeleton.
  3. Orchestrator decision: the hybrid policy receives the current partial plan and state, then selects one of the 21 repair agents.
  4. Agent execution: the chosen agent proposes a concrete modification (e.g., “add action Move(Box1, Dock)”).
  5. Verifier check: the verifier runs a fast PDDL consistency check. If the plan remains valid, the loop proceeds; otherwise, the orchestrator receives a negative signal and selects a different agent.
  6. Termination: when the verifier confirms that the goal state is achieved, the final plan is returned to the user.

Interaction Between Components

HALO’s policy network encodes the current planning state using a combination of symbolic features (e.g., number of unsatisfied goals) and dense embeddings from the language model. The rule‑based shortcuts intercept obvious cases—such as when the verifier reports “goal already satisfied”—bypassing the policy entirely. This hybrid approach reduces unnecessary LLM calls while preserving flexibility for complex decisions.

Key Differentiators

  • Supervised trajectory learning: every training example is a verified correct decision, eliminating the need for trial‑and‑error exploration.
  • Cost‑effective policy size: the QLoRA‑tuned model fits in a few hundred megabytes, dramatically cheaper than invoking a full‑scale frontier LLM at each step.
  • Scalable agent pool: the 21‑agent action space is hand‑crafted to cover a wide range of repair operations, yet the policy learns to prioritize them without exhaustive search.

Evaluation & Results

Benchmarks and Scenarios

The authors evaluated HALO on three complementary suites:

  • PlanBench: a collection of 11 classic PDDL domains (e.g., Blocksworld, Logistics) with varying difficulty levels.
  • Natural Plan: a dataset of natural‑language goals paired with ground‑truth PDDL specifications, testing the full translation pipeline.
  • Classical planning benchmarks: standard IPC (International Planning Competition) instances to gauge raw planning competence.

Key Findings

Across all benchmarks, HALOR’s performance can be summarized as follows:

  • Success rate parity: HALO matches the GPT‑5‑mini prompted baseline and stays within three percentage points of the stronger Gemini‑3‑Flash baseline.
  • Cost reduction: average orchestration cost drops from $0.18 per task (GPT‑5‑mini) to $0.004 per task—a 45× saving. Compared with Gemini‑3‑Flash, HALO is roughly 15× cheaper.
  • LLM call efficiency: total calls per episode decrease by 40‑50 %, directly translating to lower latency and higher throughput.
  • Generalization: when evaluated on unseen domains, HALO retains >90 % of its success rate, indicating that the supervised trajectory approach captures domain‑agnostic orchestration principles.

These results demonstrate that a modestly sized, supervised policy can replace heavyweight prompting without sacrificing planning quality, opening the door to production‑grade deployment.

Why This Matters for AI Systems and Agents

For AI practitioners building autonomous agents, HALO offers a concrete pathway to reduce operational expenses while preserving reliability. The ability to orchestrate a diverse set of specialized LLM agents with a cheap, trainable policy means that enterprises can:

  • Scale planning services to thousands of concurrent requests without exploding cloud bills.
  • Integrate verification loops directly into their agent pipelines, guaranteeing that generated actions respect safety constraints.
  • Reuse the same orchestrator across multiple domains, accelerating time‑to‑market for new planning‑centric products.

These advantages align closely with the goals of the AI marketing agents offering, where cost‑effective orchestration of content‑generation and personalization modules is a competitive differentiator. By adopting HALO‑style supervision, developers can shift from ad‑hoc prompting to a disciplined, data‑driven control plane.

What Comes Next

While HALO marks a significant step forward, several open challenges remain:

  • Domain expansion: extending the agent pool beyond 21 hand‑crafted modules will require automated discovery of repair primitives.
  • Dynamic verification: integrating richer, probabilistic verifiers could enable planning under uncertainty, a common scenario in robotics and logistics.
  • Cross‑modal goals: handling multimodal inputs (e.g., images, sensor data) will demand tighter coupling between perception models and the orchestrator.

Future research may explore meta‑learning techniques that allow the orchestrator to adapt on‑the‑fly to new domains with minimal additional data. From an industry perspective, embedding HALO within an Enterprise AI platform by UBOS could provide a turnkey solution for customers seeking robust, low‑cost planning capabilities across supply‑chain, manufacturing, and autonomous‑vehicle use cases.

Overall, HALO demonstrates that supervised learning from verifier‑approved trajectories can replace costly frontier‑LLM prompting, delivering both economic and performance benefits. As AI agents become more pervasive, such efficient orchestration mechanisms will be essential for sustainable, scalable deployment.


Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Sign up for our newsletter

Stay up to date with the roadmap progress, announcements and exclusive discounts feel free to sign up with your email.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.