- Updated: June 19, 2026
- 6 min read
Locality-Aware Redundancy Pruning for LLM Depth Compression
Direct Answer
Locality‑Aware Redundancy Pruning (LoRP) is a training‑free, one‑shot depth‑pruning framework that trims large language models (LLMs) by exploiting how redundant representations are distributed across layers. By measuring inter‑layer similarity with a Representation Locality Score (RLS), LoRP decides which layers can be removed without harming perplexity or downstream task performance, delivering faster inference and lower memory footprints.
Background: Why This Problem Is Hard
LLM depth pruning promises to cut inference latency, but the underlying challenge is that redundancy is not uniform. Transformers stack dozens of attention and feed‑forward blocks, and many of those blocks learn overlapping features. However, the pattern of overlap varies dramatically between model families (e.g., decoder‑only vs. encoder‑decoder) and even between model sizes within the same family.
Existing pruning pipelines typically rely on one of two assumptions:
- Local importance: Each layer is evaluated in isolation, using metrics such as weight magnitude or activation sparsity. This ignores the fact that a layer’s contribution may be compensated by another layer elsewhere in the network.
- Fixed redundancy distribution: Some methods assume that redundancy is evenly spread, applying a uniform pruning ratio across all depths. When redundancy clusters in a specific region, uniform pruning either over‑prunes useful layers or leaves unnecessary ones untouched.
Both approaches suffer from a lack of global perspective, leading to sub‑optimal compression, degraded language modeling quality, and unpredictable effects on downstream tasks such as question answering or summarization. As enterprises push LLMs into latency‑sensitive products—chatbots, real‑time assistants, and edge deployments—the need for a principled, architecture‑agnostic depth‑pruning technique has become acute.
What the Researchers Propose
The authors introduce Locality‑Aware Redundancy Pruning (LoRP), a framework that first quantifies how “local” or “global” redundancy is within a given model. The core of LoRP is the Representation Locality Score (RLS), a scalar derived from the similarity of hidden states across every pair of layers, computed on a modest calibration dataset.
Key components of LoRP include:
- Global similarity matrix: Pairwise cosine similarity of layer‑wise hidden representations, aggregated over the calibration set.
- Clustering engine: A lightweight unsupervised algorithm (e.g., hierarchical agglomerative clustering) that groups layers with high mutual similarity, revealing redundancy clusters.
- Residual redundancy estimator: Within each cluster, LoRP measures the remaining intra‑cluster variance after hypothetically removing a layer, guiding how many layers to prune from that cluster.
By aligning pruning decisions with the actual distribution of representational overlap, LoRP adapts to both localized redundancy (e.g., a block of consecutive layers that learn similar patterns) and globally distributed redundancy (e.g., scattered similar layers across the depth).
How It Works in Practice
The LoRP workflow can be broken down into four sequential steps, each of which can be executed without any gradient updates or fine‑tuning:
1. Calibration Data Collection
A small, task‑agnostic dataset—often a few hundred sentences—is fed through the intact LLM. The hidden states from every transformer block are cached.
2. Representation Locality Scoring
For each pair of layers (i, j), LoRP computes the average cosine similarity of their hidden states across the calibration set. The resulting similarity matrix S captures how much information is shared between any two depths.
3. Redundancy Clustering
The similarity matrix feeds into a clustering routine that groups layers with high mutual similarity. Each cluster represents a region of the network where redundancy is concentrated.
4. Targeted Depth Pruning
Within each cluster, LoRP evaluates the impact of removing each layer by measuring the residual variance of the cluster’s representations. Layers that cause the smallest increase in variance are marked for removal. The final pruned architecture is assembled by stitching together the retained layers, preserving the original input‑output interface.
What sets LoRP apart from prior methods is its global awareness. Instead of applying a blanket pruning ratio, LoRP tailors the pruning intensity to the actual redundancy landscape of the model, ensuring that no critical transformation is inadvertently discarded.
Evaluation & Results
The authors benchmark LoRP on three representative LLM families:
- Decoder‑only models (e.g., a 7B‑parameter GPT‑style architecture)
- Encoder‑decoder models (e.g., a 13B‑parameter T5 variant)
- Mixture‑of‑Experts (MoE) models with sparsely activated layers
Each model is evaluated on two fronts:
Language Modeling Perplexity
After pruning, the perplexity on a held‑out validation set rises by less than 2 % on average, even when up to 30 % of the depth is removed. In contrast, uniform one‑shot pruning of the same magnitude typically incurs a 5–8 % perplexity increase.
Downstream Task Accuracy
Zero‑shot and few‑shot performance on benchmarks such as SuperGLUE, MMLU, and open‑domain QA are preserved within 1 % of the baseline. Notably, for the encoder‑decoder family, LoRP even improves factual consistency on summarization tasks, suggesting that removing redundant layers can act as a regularizer.
Across all experiments, inference latency drops by 20–35 % on a single GPU, and memory consumption is reduced proportionally to the number of pruned layers. These gains are achieved without any additional fine‑tuning, highlighting LoRP’s practicality for production pipelines.
Why This Matters for AI Systems and Agents
Depth‑pruned LLMs directly translate into faster response times and lower operational costs—critical factors for AI agents that must operate under strict latency budgets (e.g., conversational assistants, real‑time recommendation engines, or autonomous decision‑making loops). By preserving model quality while shedding unnecessary depth, LoRP enables:
- Scalable agent orchestration: Multiple agents can share a single pruned backbone, freeing GPU memory for additional context windows or parallel inference.
- Edge deployment: Smaller, faster models fit within the compute envelope of on‑device inference chips, opening pathways for privacy‑preserving AI assistants.
- Cost‑effective experimentation: Teams can iterate on prompts, tool integrations, and workflow automations without incurring the expense of full‑scale model hosting.
Practitioners can immediately apply LoRP to existing models hosted on the UBOS platform overview, then combine the compressed model with the Workflow automation studio to build end‑to‑end AI pipelines. For conversational products, the ChatGPT and Telegram integration can now run on a leaner backbone, delivering smoother user experiences under the same hardware budget.
What Comes Next
While LoRP demonstrates strong compression without fine‑tuning, several avenues remain open for further research:
- Dynamic pruning at inference time: Extending LoRP to decide on‑the‑fly which layers to skip based on input complexity could yield additional latency reductions.
- Cross‑modal redundancy analysis: Applying representation locality concepts to multimodal models (vision‑language, audio‑text) may uncover new compression opportunities.
- Integration with quantization: Combining depth pruning with weight quantization could push memory savings beyond the 30 % depth reduction demonstrated.
- Automated calibration set selection: Researching how to choose the most informative calibration sentences could make LoRP even more robust across domains.
From a product perspective, developers can explore LoRP‑compressed models within the Enterprise AI platform by UBOS to power large‑scale customer‑facing agents. Start‑ups may find the UBOS for startups offering a quick path to prototype AI services that run efficiently on modest cloud instances.
Finally, the open‑source community is invited to contribute clustering heuristics and variance estimators, fostering a collaborative ecosystem around locality‑aware compression.
References
For the full technical details, see the original arXiv paper.
Illustration

Published on the UBOS blog.