- Updated: July 1, 2026
- 6 min read
Human-Less LLM Serving: Quantifying the Human Tax on Throughput
Direct Answer
The paper introduces the concept of human‑less LLM serving, a performance‑focused serving mode that removes the latency guarantees designed for interactive users (TTFT and TPOT) when workloads run autonomously in tight loops. By quantifying the “human tax”—the throughput loss incurred by enforcing human‑centric SLAs—the authors show that up to 93 % of potential throughput can be reclaimed, especially for long‑context and highly concurrent workloads.

Background: Why This Problem Is Hard
Large language model (LLM) serving platforms have been built around two user‑experience metrics:
- Time‑to‑First‑Token (TTFT): the latency from request arrival to the generation of the first token.
- Time‑per‑Output‑Token (TPOT): the steady‑state latency for each subsequent token.
These SLAs make sense for interactive chat or search scenarios where a human watches the response unfold. Modern AI agents, however, often execute long‑horizon tasks that involve thousands of LLM calls in rapid succession—think autonomous planning loops, data‑extraction pipelines, or reinforcement‑learning‑from‑human‑feedback (RLHF) simulations. In such cases, the human never observes the intermediate latency, yet the serving stack still enforces TTFT/TPOT limits.
Existing serving systems (e.g., SGLang, Sarathi‑Serve) embed these limits deep in request scheduling, token batching, and resource allocation. The trade‑off is clear: meeting tight TTFT/TPOT guarantees requires reserving compute headroom, pre‑emptively throttling concurrency, and padding request pipelines—all of which shrink raw throughput. Because the infrastructure treats all traffic uniformly, autonomous workloads pay a hidden “human tax” without any benefit.
What the Researchers Propose
The authors propose a human‑less serving paradigm that decouples SLA enforcement from workloads that do not need human‑visible latency. Their framework consists of three logical components:
- Workload Classifier: a lightweight module that tags incoming requests as either human‑centric or autonomous based on metadata (e.g., API key, request pattern, or explicit flag).
- Dynamic Scheduler: a dual‑queue scheduler that applies strict TTFT/TPOT constraints only to the human‑centric queue, while allowing the autonomous queue to operate under a “throughput‑first” policy.
- Resource Governor: a feedback loop that reallocates compute slices between the two queues in real time, ensuring that the autonomous queue can fully saturate the GPU/CPU pipelines when human traffic is low.
By exposing a new SLA configuration—human‑less mode—the system can automatically switch to the throughput‑optimal path for programmatic workloads, eliminating the unnecessary latency padding that traditionally drags down performance.
How It Works in Practice
At a high level, the workflow proceeds as follows:
- Request Ingestion: Every incoming LLM call arrives at the front‑door API gateway.
- Classification: The Workload Classifier inspects request headers or payload signatures to decide the SLA class.
- Queue Placement: Human‑centric requests are placed in the latency‑aware queue, while autonomous calls go to the throughput‑aware queue.
- Scheduling: The Dynamic Scheduler pulls from both queues based on current system load. When the latency queue is empty, the scheduler aggressively batches autonomous requests, maximizing token‑level parallelism.
- Execution: The Resource Governor monitors GPU utilization and dynamically expands the compute slice for the autonomous queue, allowing larger batch sizes and longer context windows without violating any human‑centric guarantees.
- Response Delivery: Results are streamed back to the caller. Autonomous callers typically consume the full response programmatically, so the system can skip intermediate token streaming optimizations that are required for human‑visible latency.
This separation is the key differentiator from traditional serving stacks, which treat every request as if a human is watching. By allowing the autonomous queue to “fill the pipe,” the system achieves near‑theoretical hardware utilization.
Evaluation & Results
The researchers conducted a systematic measurement study across two popular serving frameworks—SGLang and Sarathi‑Serve—varying four dimensions:
- Chunk size (number of tokens per request)
- SLA settings (tight vs. relaxed TTFT/TPOT)
- Context length (up to 64 K tokens)
- Concurrency level (simultaneous request streams)
Key scenarios included:
- Long‑context generation: Simulating document‑level summarization with 32 K–64 K token windows.
- High‑concurrency loops: Running 128 parallel autonomous agents that each issue 10 K LLM calls per minute.
Findings:
- When TTFT was tightened to production‑typical values (≤ 50 ms), throughput dropped by 60 %–93 % compared with the human‑less baseline, with the largest loss observed at 64 K token contexts.
- Throughput degradation grew non‑linearly with concurrency; at 256 concurrent streams, the human tax approached 90 % for both serving stacks.
- The “human‑less” prototype reclaimed most of the lost capacity, delivering up to a 12× increase in requests per second for autonomous workloads without affecting latency‑sensitive traffic.
These results demonstrate that the hidden cost of universal human‑centric SLAs is not marginal—it can dominate the performance envelope for modern AI pipelines that rely on massive context windows and high parallelism.
Why This Matters for AI Systems and Agents
For engineers building autonomous agents, data‑processing pipelines, or large‑scale RLHF loops, the human tax translates directly into higher cloud spend and longer experiment cycles. By adopting a workload‑class‑aware SLA model, teams can:
- Reduce compute costs by up to 90 % for batch‑oriented workloads.
- Accelerate iteration speed for research experiments that involve millions of LLM calls.
- Improve resource fairness in multi‑tenant environments, ensuring that interactive users still receive low latency while background jobs consume spare capacity.
- Enable new product designs such as continuous‑learning agents that run 24/7 without throttling.
Practically, organizations can integrate the human‑less mode into existing stacks using the UBOS platform overview, which already supports dynamic SLA policies and workload tagging. For teams focused on conversational AI, the ChatGPT and Telegram integration can be configured to apply strict TTFT only to end‑user chat sessions, while background analytics pipelines run in human‑less mode.
What Comes Next
While the study provides compelling evidence, several open challenges remain:
- Fine‑grained classification: Determining the optimal granularity for tagging requests (per‑API key vs. per‑function) without adding overhead.
- Adaptive SLA negotiation: Allowing clients to request temporary latency guarantees for autonomous jobs that become user‑visible at specific checkpoints.
- Cross‑framework standardization: Extending the human‑less concept to other serving ecosystems (e.g., Triton Inference Server, vLLM) to avoid vendor lock‑in.
- Security and isolation: Ensuring that the throughput‑first queue cannot starve latency‑critical traffic under malicious load spikes.
Future research could explore machine‑learning‑driven schedulers that predict workload patterns and pre‑emptively reallocate resources. Additionally, integrating the approach with Enterprise AI platform by UBOS would give large organizations a unified control plane for SLA policies across on‑prem and cloud deployments.
Developers interested in experimenting with the prototype can find the code and detailed benchmark scripts in the original arXiv paper. The authors also release a Docker‑based demo that showcases a switchable human‑less mode on a single GPU, making it easy to reproduce the reported 60 %–93 % throughput gains.
By rethinking the default assumption that every LLM request needs human‑centric latency, the community can unlock a new efficiency frontier for AI workloads that run at scale.