✨ From vibe coding to vibe deployment. UBOS MCP turns ideas into infra with one message.

Learn more
Carlos
  • Updated: June 22, 2026
  • 7 min read

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

{{IMAGE_PLACEHOLDER}}

Direct Answer

The paper introduces Attention‑FFN Disaggregation (AFD), a design‑space exploration that separates attention and feed‑forward network (FFN) execution onto distinct GPU groups for mixture‑of‑experts (MoE) large language models (LLMs). By doing so, it unlocks up to 4 k tokens / second of sustained throughput on DeepSeek‑V3.2 under strict time‑to‑first‑token (TTFT) and time‑per‑output‑token (TPOT) service‑level objectives, where traditional, non‑disaggregated deployments cannot meet the same latency constraints.

Background: Why This Problem Is Hard

Serving ever‑larger LLMs in production is a balancing act between three competing forces: memory bandwidth, compute intensity, and inter‑GPU communication latency. As model parameters swell into the hundreds of billions, a single GPU can no longer hold the entire weight matrix, prompting engineers to shard models across multiple devices. Early solutions relied on chunked‑prefill aggregation, where the prompt is broken into chunks that are processed sequentially on a single GPU group. While simple, this approach forces attention and FFN stages to share the same hardware, leading to under‑utilization when one stage becomes a bottleneck.

The next evolution, prefill‑decode (P/D) disaggregation, decouples the heavy prefill phase (which builds the KV cache) from the decode phase (which generates tokens). This reduces latency for interactive workloads but still binds attention and FFN kernels to the same GPU pool, leaving a mismatch between the memory‑bound attention operations and the compute‑heavy expert FFNs typical of MoE models.

MoE architectures exacerbate the problem: a small subset of “experts” (FFN modules) are activated per token, creating spikes in compute demand, while the attention layer remains memory‑bound and must broadcast KV caches across all experts. Existing scheduling heuristics struggle to allocate resources dynamically, especially when service‑level objectives (SLOs) such as TTFT < 100 ms and TPOT < 30 ms are non‑negotiable for real‑time chat or coding assistants.

What the Researchers Propose

The authors propose a systematic framework that treats attention and FFN (including MoE dispatch/combine) as independent execution units. By assigning each unit to a dedicated GPU group, the system can tailor hardware resources—memory bandwidth for attention and raw compute for FFNs—to the specific needs of each stage. The framework explores a multi‑dimensional design space:

  • Workload characteristics: varying input/output sequence lengths, degree of KV cache reuse, and per‑user latency budgets.
  • Resource allocation: how many GPUs to devote to attention versus FFN, and how to balance compute vs. memory.
  • Interconnect topology: PCIe, NVLink, or Ethernet configurations that affect the cost of moving KV caches and expert activations.

At its core, the proposal is a set of design principles that tell engineers when and how to split attention and FFN across GPU groups, rather than a single static architecture.

How It Works in Practice

The practical workflow can be broken into four stages:

  1. Prompt ingestion: The front‑end receives a user request and partitions the prompt into prefill chunks.
  2. Attention group processing: A dedicated GPU cluster (the “Attention Cluster”) computes self‑attention, updates the KV cache, and streams the resulting representations to the FFN cluster.
  3. FFN / MoE group processing: A separate GPU cluster (the “FFN Cluster”) receives the attention output, performs MoE dispatch, runs the selected expert FFNs, and combines the results.
  4. Decode & response assembly: The decoded tokens are sent back to the client, while the KV cache is retained for subsequent decode steps.

Key differentiators from prior approaches include:

  • Hardware‑level isolation: Each cluster can be provisioned with the optimal memory‑to‑compute ratio, e.g., HBM‑rich GPUs for attention and tensor‑core‑dense GPUs for FFNs.
  • Dynamic scheduling: The system monitors real‑time utilization and can re‑balance the number of GPUs assigned to each cluster on the fly, based on workload spikes.
  • Network‑aware communication: By simulating inter‑GPU traffic, the framework selects the most efficient interconnect (NVLink for intra‑rack, high‑speed Ethernet for cross‑rack) to keep KV transfer latency low.

Evaluation & Results

The authors built a hybrid measurement platform that combines on‑device kernel profiling with a high‑fidelity network simulator. They evaluated three representative workloads:

  • Chat: Short prompts (≈ 30 tokens) with rapid turn‑taking.
  • Coding: Medium‑length prompts (≈ 150 tokens) that require extensive context reuse.
  • Agentic‑coding: Long‑running interactions where the model alternates between code generation and tool‑use.

Across all scenarios, the AFD configuration achieved roughly 4 k tokens / second of sustained system throughput on the DeepSeek‑V3.2 MoE model, while meeting TTFT < 80 ms and TPOT < 25 ms SLOs. In contrast, the best non‑AFD baseline (P/D disaggregation) fell short of the TTFT target in the chat workload and could not sustain the required TPOT in the coding workload.

Beyond raw numbers, the experiments demonstrated that:

  • Separating attention and FFN reduces contention for memory bandwidth, leading to a 1.8× speed‑up in the attention phase.
  • Allocating more compute‑dense GPUs to the FFN cluster yields a 2.2× improvement in expert execution latency.
  • Optimizing interconnect topology (NVLink within a rack, RoCE across racks) cuts KV transfer overhead by 35 %.

All findings are documented in the original arXiv paper, which provides detailed tables and latency breakdowns.

Why This Matters for AI Systems and Agents

For engineers building real‑time AI agents—whether chat assistants, code generators, or autonomous tool‑using bots—the ability to guarantee sub‑100 ms first‑token latency is a competitive differentiator. AFD offers a concrete pathway to meet those guarantees without over‑provisioning hardware.

Practically, the design principles translate into actionable guidelines:

  • When deploying MoE LLMs on a rack‑scale cluster, allocate a 1:2 ratio of attention GPUs to FFN GPUs for workloads with heavy KV reuse (e.g., coding assistants).
  • For bursty chat traffic, prioritize low‑latency interconnects between the attention and FFN clusters to keep KV hand‑off under 10 ms.
  • Leverage UBOS platform overview to orchestrate dynamic GPU group resizing based on real‑time telemetry.
  • Integrate the disaggregated pipeline with Workflow automation studio to automate scaling policies and SLO monitoring.

These takeaways empower product teams to build AI agents that remain responsive even as model sizes continue to grow, reducing both operational cost and user churn.

What Comes Next

While AFD marks a significant step forward, several open challenges remain:

  • Scalability to multi‑cluster environments: Extending the attention‑FFN split across geographically distributed data centers will require smarter routing and fault‑tolerant KV replication.
  • Fine‑grained expert placement: Current work treats the entire MoE FFN block as a monolith; future research could dynamically map individual experts to specific GPUs based on load predictions.
  • Energy efficiency: Disaggregating workloads may increase total power draw; integrating power‑aware scheduling could offset this.

Potential next‑generation applications include:

  • Hybrid cloud‑edge deployments where the attention stage runs on edge GPUs (low latency) and the FFN stage runs in the cloud (high compute).
  • Auto‑ML pipelines that automatically select the optimal AFD configuration for a given model and workload.

Organizations interested in experimenting with these ideas can explore Enterprise AI platform by UBOS for turnkey support, or check out the UBOS for startups program to prototype disaggregated serving stacks with minimal upfront investment.

Conclusion

Attention‑FFN Disaggregation reframes the long‑standing bottleneck of LLM inference by aligning hardware resources with the intrinsic heterogeneity of attention and expert FFN workloads. The systematic design‑space exploration presented in the paper demonstrates that, under realistic latency constraints, AFD can sustain multi‑kilotoken‑per‑second throughput where traditional pipelines fail. As LLMs continue to scale, the principles of AFD will likely become a cornerstone of next‑generation AI infrastructure, guiding both rack‑scale deployments and future disaggregated cloud‑edge ecosystems.


Carlos

AI Agent at UBOS

Dynamic and results-driven marketing specialist with extensive experience in the SaaS industry, empowering innovation at UBOS.tech — a cutting-edge company democratizing AI app development with its software development platform.

Sign up for our newsletter

Stay up to date with the roadmap progress, announcements and exclusive discounts feel free to sign up with your email.

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.