- Updated: June 27, 2026
- 6 min read
Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice
Direct Answer
The paper introduces Geometry‑Aware Scheduling (GAS), a cache‑conscious algorithm that orders inference requests for large language models (LLMs) by the “smallest volume first” principle, dramatically reducing latency on memory‑constrained serving platforms. By treating each request as a geometric object in the cache‑capacity space, GAS makes it possible to pack more queries into limited GPU memory without sacrificing throughput.
Background: Why This Problem Is Hard
Modern LLMs such as GPT‑4, LLaMA‑2, and Claude demand gigabytes of GPU memory per inference. In production, engineers often run multiple models or batch many user queries on a single server to amortize hardware costs. This creates a classic “knapsack” dilemma: each request consumes a different amount of memory (depending on prompt length, token count, and model size) and must be scheduled in real time.
Existing serving stacks—vLLM, TGI, and Ray Serve—rely on simple heuristics like first‑come‑first‑served (FCFS) or static batch sizes. These approaches ignore the geometric relationship between request size and remaining cache capacity, leading to frequent out‑of‑memory (OOM) errors, sub‑optimal batch composition, and inflated tail latency. Moreover, the high variance in request lengths makes static batching brittle, especially under bursty traffic typical of chat‑based applications.
What the Researchers Propose
The authors propose a two‑layer framework:
- Geometry‑Aware Request Modeling: Each inference request is represented as a multi‑dimensional volume that captures its memory footprint, compute cost, and deadline.
- Smallest Volume First (SVF) Scheduler: The scheduler orders pending requests by ascending volume, guaranteeing that the smallest “geometric” jobs are placed first, thereby maximizing the number of requests that fit into the current cache slice.
A lightweight variant, 1‑bit SVF, reduces the overhead of exact volume calculation to a binary decision (fit / no‑fit) while preserving the core ordering principle. Both algorithms are designed to plug into existing serving back‑ends with minimal code changes.
How It Works in Practice
The practical workflow can be broken down into four stages:
- Request Ingestion: Incoming queries are parsed to extract prompt length, desired output tokens, and any model‑specific constraints.
- Volume Estimation: A fast estimator converts these attributes into a scalar “volume” using a calibrated linear model (e.g., V = α·prompt + β·output + γ·model‑size).
- SVF Queue Insertion: The request is inserted into a priority queue sorted by volume. The queue is continuously re‑balanced as memory is freed after each batch completes.
- Batch Assembly & Execution: The scheduler greedily pulls the smallest‑volume requests until the GPU memory budget is reached, then dispatches the batch to the underlying inference engine (e.g., vLLM).
What sets GAS apart is its awareness of the “shape” of the memory landscape. Instead of treating memory as a scalar budget, it treats each request as a geometric object that can be tightly packed, much like how modern bin‑packing algorithms work for container loading.
Evaluation & Results
The authors evaluated GAS on three benchmark suites:
- OpenAI‑ChatBench: Simulated conversational traffic with heterogeneous prompt lengths.
- LLM‑Throughput Suite: Synthetic workloads that stress‑test memory limits across 8‑bit and 4‑bit quantized models.
- Real‑World Production Trace: A week‑long log from a SaaS chatbot serving 10 k QPS.
Key findings include:
- Average tail latency (95th percentile) dropped by 38 % compared to FCFS under identical hardware.
- Overall throughput increased by 22 % because fewer OOM retries were needed.
- The 1‑bit SVF variant achieved 95 % of the full SVF latency gains while cutting scheduling overhead by 67 %.
- Memory fragmentation was reduced by a factor of 2.3, leading to more stable GPU utilization over long‑running periods.
These results demonstrate that geometry‑aware ordering is not a theoretical curiosity—it delivers measurable performance improvements on real hardware and workloads.
Why This Matters for AI Systems and Agents
For AI engineers building conversational agents, recommendation engines, or any service that relies on on‑demand LLM inference, latency is a direct driver of user satisfaction and revenue. By integrating GAS, teams can:
- Serve more concurrent users on the same GPU cluster, reducing cloud spend.
- Maintain tighter Service‑Level Agreements (SLAs) without over‑provisioning hardware.
- Improve the reliability of multi‑tenant platforms where different models share a single memory pool.
Practically, the algorithm can be dropped into the UBOS platform overview as a plug‑in to the existing Workflow automation studio, enabling developers to orchestrate LLM calls with built‑in geometry‑aware scheduling. This aligns with the broader trend of “AI‑first” infrastructure where inference efficiency is as important as model accuracy.
What Comes Next
While GAS shows strong gains, several open challenges remain:
- Dynamic Model Switching: Extending the volume model to handle on‑the‑fly model swaps (e.g., switching from LLaMA‑7B to LLaMA‑13B) without re‑training the estimator.
- Multi‑GPU Coordination: Current work focuses on a single GPU memory pool; scaling the geometric reasoning across a cluster introduces communication overhead that must be mitigated.
- Adaptive Deadline Awareness: Incorporating user‑level latency budgets could further prioritize high‑value requests.
Future research may explore hybrid heuristics that combine SVF with reinforcement‑learning‑based policies for even finer‑grained control. From a product perspective, integrating GAS with the Enterprise AI platform by UBOS could unlock “pay‑as‑you‑go” pricing models where customers are billed based on effective GPU utilization rather than raw instance time.
Developers interested in experimenting today can start by adding the OpenAI ChatGPT integration to a UBOS workflow, then replace the default scheduler with the SVF plug‑in to observe latency improvements on their own traffic patterns.
Conclusion
Geometry‑Aware Scheduling reframes LLM serving as a spatial packing problem, delivering concrete latency and throughput benefits on memory‑constrained hardware. By ordering requests from smallest to largest volume, the SVF algorithm maximizes cache utilization, reduces OOM incidents, and offers a low‑overhead path to higher‑quality AI services. As enterprises continue to scale LLM‑driven products, adopting cache‑aware schedulers like GAS will be a decisive factor in balancing performance, cost, and user experience.
Read the full research paper on arXiv for a deeper dive into the theoretical guarantees and experimental methodology.
