- Updated: June 30, 2026
- 6 min read
Less is More: Lightweight Prompt Compression for Question Answering Applications on Edge Devices
Direct Answer
The paper introduces CORE (Context‑Optimized Retrieval‑enhanced compression), a two‑stage, sentence‑level prompt compression technique that trims retrieval‑augmented generation (RAG) prompts without relying on auxiliary small language models. By cutting redundant context, CORE boosts answer accuracy on edge devices while slashing memory use, latency, and energy consumption.
Background: Why This Problem Is Hard
Agent‑driven question answering (QA) systems increasingly pair large language models (LLMs) with a retrieval layer that fetches relevant documents from a knowledge base. This retrieval‑augmented generation pipeline improves factuality, but it also creates a new bottleneck:
- Noise and redundancy: Document‑level retrieval often returns whole passages that contain overlapping facts, filler sentences, or unrelated material.
- Token budget limits: LLM APIs and on‑device inference engines charge per token and have hard limits (e.g., 2 000 tokens) that force developers to truncate prompts, risking loss of critical information.
- Edge constraints: Edge hardware such as NVIDIA Jetson AGX Orin or consumer smartphones has limited RAM, compute, and battery life, making large prompts prohibitively expensive.
Existing prompt‑compression solutions typically train or invoke a small language model (SLM) to score each sentence’s importance. While effective in cloud environments, those SLMs add memory footprints (often > 200 MB) and extra inference steps, negating the very efficiency gains they aim to provide. Consequently, developers lack a lightweight, on‑device method to prune context without sacrificing answer quality.
What the Researchers Propose
CORE tackles the problem by replacing the SLM‑based scoring stage with a rule‑driven, two‑phase pipeline that operates entirely on the retrieved sentences:
- Answer Set Construction: Named Entity Recognition (NER) extracts entities (people, dates, locations, etc.) that are likely to appear in the final answer. Sentences containing these entities form an initial “answer set.”
- Clue Set Construction: A lightweight semantic matcher (e.g., cosine similarity over sentence embeddings) identifies sentences that are semantically close to the user query, creating a “clue set.”
- Orthogonal Residual Retrieval: CORE removes any overlap between the answer and clue sets, then retrieves additional sentences that are orthogonal—i.e., they provide complementary information rather than duplicate content.
- Spatial Proximity Filtering: Using a custom metric that measures positional distance between entities and clue sentences within the original documents, CORE discards sentences that are far apart, preserving only tightly coupled context.
The final compressed prompt is the union of the refined answer set and the filtered clue set, guaranteeing that the most answer‑relevant and query‑relevant sentences survive while redundant filler is eliminated—all without a secondary language model.
How It Works in Practice
The practical workflow can be visualized as a pipeline that sits between the retrieval engine and the LLM inference module:
- Query Reception: An end‑user submits a question to the QA agent.
- Document Retrieval: A standard BM25 or dense vector retriever returns the top‑k documents (often 5–10) based on relevance.
- Sentence Segmentation: Each document is split into individual sentences, forming a candidate pool.
- CORE Stage 1 – Answer & Clue Sets:
- Run a fast NER model (e.g., spaCy) on the query and candidate sentences to collect entities.
- Mark sentences that contain any of those entities as part of the answer set.
- Compute sentence embeddings (e.g., MiniLM) and measure cosine similarity to the query; top‑n similar sentences become the clue set.
- CORE Stage 2 – Orthogonal Retrieval & Proximity Filtering:
- Subtract the intersection of answer and clue sets, then retrieve orthogonal sentences that add new entities or facts.
- Apply the spatial proximity metric: for each candidate, calculate the distance between its position and the nearest answer‑set sentence within the same source document; discard those beyond a configurable threshold.
- Prompt Assembly: Concatenate the filtered answer and clue sentences, prepend the original user query, and enforce the 2 000‑token budget.
- LLM Generation: Feed the compressed prompt to the on‑device LLM (e.g., LLaMA‑2‑7B) for answer generation.
What sets CORE apart is its model‑free scoring. By leveraging deterministic NLP primitives (NER, embedding similarity, positional heuristics), the method runs in under 30 ms on a Jetson AGX Orin and consumes less than 150 MB of RAM—well within the constraints of most edge platforms.
Evaluation & Results
The authors benchmarked CORE on two representative edge devices: an NVIDIA Jetson AGX Orin (GPU‑accelerated) and a Huawei Nova smartphone (CPU‑only). They compared CORE against three baselines:
- Raw RAG (no compression)
- LLMLingua2 (state‑of‑the‑art SLM‑based compression)
- Simple truncation (first‑N‑tokens)
Key findings include:
| Metric | CORE | LLMLingua2 | Raw RAG |
|---|---|---|---|
| Exact‑match accuracy (within 2 000‑token budget) | +30.19 % over LLMLingua2 | Baseline | −12.4 % (due to token overflow) |
| Peak memory usage | ≈ 48 % of LLMLingua2 | 100 % | 115 % |
| Inference latency | 1.94 × faster than LLMLingua2 | Baseline | Similar to LLMLingua2 (both include SLM overhead) |
| Energy consumption (smartphone) | 95.74 % lower than LLMLingua2 | Baseline | Higher due to longer inference time |
Beyond raw numbers, the experiments demonstrate that CORE preserves critical answer cues while discarding noise, leading to more reliable generation even when the token budget is tight. The energy savings are especially compelling for battery‑powered devices, where every milliwatt counts.
Why This Matters for AI Systems and Agents
For practitioners building AI agents that must run locally—think voice assistants, on‑device customer support bots, or industrial IoT diagnostics—CORE offers a pragmatic path to high‑quality QA without cloud dependence. The method aligns with three strategic priorities:
- Scalability on constrained hardware: By eliminating the need for an auxiliary SLM, developers can deploy on devices with as little as 2 GB RAM.
- Cost reduction: Fewer tokens mean lower API bills for hybrid cloud‑edge solutions, and the reduced compute translates to longer device lifespans.
- Privacy compliance: Keeping the entire pipeline on‑device ensures that user queries never leave the hardware, simplifying GDPR and CCPA adherence.
Integrating CORE into an existing workflow is straightforward. For example, a developer using the UBOS platform overview can plug the CORE module into the Workflow automation studio as a “Prompt Optimizer” step before the LLM node. This enables rapid prototyping of edge‑ready agents without rewriting retrieval logic.
What Comes Next
While CORE marks a significant advance, several avenues remain open for exploration:
- Adaptive thresholds: Current proximity and similarity cut‑offs are static; learning device‑specific thresholds could further improve the accuracy‑efficiency trade‑off.
- Multilingual extensions: The present implementation relies on English‑centric NER; extending to multilingual entity extractors would broaden applicability.
- Hybrid retrieval strategies: Combining sentence‑level retrieval with keyword‑level filters might capture rare facts that embeddings miss.
- Integration with voice pipelines: Pairing CORE with ElevenLabs AI voice integration could enable low‑latency spoken QA on wearables.
Developers interested in experimenting with CORE can start by cloning the open‑source reference implementation (linked in the arXiv paper) and deploying it on the Enterprise AI platform by UBOS. The modular design encourages plug‑and‑play with existing retrieval stacks, making it a viable component for next‑generation edge AI agents.
