- Updated: January 30, 2026
- 7 min read
CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference
Direct Answer
The CHIME paper introduces a chiplet‑based heterogeneous memory architecture that couples 3‑D DRAM (M3D) with resistive RAM (RRAM) to enable near‑memory execution of large‑scale multimodal language models on edge devices. By moving compute close to data and exploiting the complementary strengths of volatile and non‑volatile memories, CHIME delivers orders‑of‑magnitude improvements in latency and energy efficiency, making high‑performance LLM inference feasible on power‑constrained platforms.
Background: Why This Problem Is Hard
Edge AI workloads—especially multimodal large language models (LLMs) that process text, images, and audio—are hitting a performance wall for three intertwined reasons:
- Memory bandwidth bottleneck: Modern LLMs require moving terabytes of weight data per inference pass. Conventional SoCs rely on a single memory channel, causing severe contention and throttling.
- Energy constraints: Battery‑operated devices cannot sustain the power draw of high‑throughput DRAM accesses combined with GPU‑class compute.
- Form‑factor limits: Adding more DRAM chips or larger GPUs quickly exceeds the thermal and area budgets of edge form factors such as autonomous drones, AR glasses, or industrial sensors.
Existing approaches try to mitigate these issues by:
- Compressing models (quantization, pruning) – which sacrifices accuracy.
- Offloading inference to the cloud – which introduces latency, privacy, and connectivity concerns.
- Using monolithic accelerators with on‑chip SRAM – which cannot store modern LLM parameters that exceed on‑chip capacity.
None of these strategies simultaneously address bandwidth, energy, and capacity at the edge, leaving a gap for a hardware‑software co‑design that can keep data close to compute while scaling memory density.
What the Researchers Propose
CHIME (Chiplet‑based Heterogeneous In‑Memory Engine) proposes a modular architecture that stitches together two complementary memory technologies:
- M3D DRAM chiplets: Stacked 3‑D DRAM provides high bandwidth and large capacity for bulk weight storage.
- RRAM chiplets: Resistive RAM offers ultra‑low‑energy, non‑volatile storage that can be programmed to act as compute primitives (e.g., vector‑dot products) directly inside the memory array.
The key insight is to treat the memory system as a heterogeneous compute fabric rather than a passive data reservoir. CHIME introduces a mapping framework that automatically partitions a given LLM graph into three zones:
- Hot kernels (e.g., attention matrix multiplications) are offloaded to RRAM where they execute as in‑memory analog operations.
- Cold kernels (e.g., control flow, token sampling) remain on a lightweight edge processor.
- Intermediate buffers reside in M3D DRAM, feeding data to the RRAM compute units with minimal latency.
This division lets CHIME exploit the high‑throughput path of DRAM for bulk movement while leveraging the energy‑proportional nature of RRAM for the most compute‑intensive kernels.
How It Works in Practice
Conceptual Workflow
- Model Partitioning: The CHIME compiler analyzes the LLM’s dataflow graph and tags each operation with its memory‑access pattern and compute intensity.
- Chiplet Allocation: Tagged operations are mapped to either an M3D DRAM tile (for streaming reads/writes) or an RRAM tile (for in‑memory matrix‑vector products).
- Data Placement: Model weights are stored across the DRAM and RRAM chiplets according to the allocation map. Frequently accessed weight slices are duplicated in RRAM to avoid cross‑chip traffic.
- Execution Loop: During inference, the edge processor issues a high‑level schedule. DRAM streams input tokens to RRAM, RRAM performs analog dot‑products, and results are accumulated back in DRAM before being post‑processed by the processor.
- Dynamic Re‑mapping (optional): For workloads with shifting hot spots, CHIME can re‑program RRAM cells on‑the‑fly, effectively moving compute kernels without hardware redesign.
Component Interaction
| Component | Role | Interaction Pattern |
|---|---|---|
| Edge Processor (CPU/Small GPU) | Orchestrates control flow, tokenization, and post‑processing. | Issues commands to DRAM controller; receives aggregated results. |
| M3D DRAM Chiplet | High‑capacity, high‑bandwidth buffer for model weights and intermediate activations. | Streams data to/from RRAM via a low‑latency interposer. |
| RRAM Chiplet | Performs analog in‑memory compute (vector‑dot, MAC) on stored weights. | Receives activation vectors from DRAM, returns partial sums. |
| CHIME Mapping Framework | Software stack that partitions the graph and programs the chiplets. | Generates micro‑code for RRAM, configures DRAM channels, and schedules tasks. |
What Sets CHIME Apart
- Chiplet Modularity: Designers can mix‑and‑match DRAM and RRAM dies from different vendors, scaling capacity without redesigning the entire SoC.
- Near‑Memory Execution: By executing the most expensive kernels inside RRAM, data movement is reduced by up to 90 % compared with traditional CPU‑GPU pipelines.
- Energy‑Proportional Compute: Analog RRAM operations consume picojoules per MAC, dramatically lowering the energy per token.
- Scalable Mapping: The framework automatically adapts to new model sizes, making CHIME future‑proof for upcoming multimodal LLMs.
Evaluation & Results
Testbed and Benchmarks
The authors evaluated CHIME on two representative edge workloads:
- Multimodal LLM inference: A 7 B‑parameter vision‑language model processing image‑caption pairs.
- Token‑level language generation: A 13 B‑parameter decoder‑only model generating English text.
Baseline platforms included:
- NVIDIA Jetson Orin NX (GPU‑centric edge accelerator).
- FACIL, a state‑of‑the‑art FPGA‑based inference engine.
Key Findings
- Latency Reduction: CHIME achieved a 4.8× lower end‑to‑end latency for image‑caption generation compared with Jetson Orin NX, and a 3.2× improvement over FACIL.
- Energy Efficiency: Measured energy per token dropped from 1.2 mJ (Jetson) to 0.18 mJ on CHIME—a 6.7× gain.
- Throughput Scaling: By adding additional RRAM chiplets, throughput scaled linearly up to 12 TOPS (tera‑operations per second) without exceeding a 5 W power envelope.
- Model Fidelity: Quantization to 4‑bit weights in RRAM preserved <1 % BLEU score loss on the captioning task, confirming that analog compute does not compromise accuracy.
Why the Results Matter
These numbers demonstrate that a heterogeneous memory fabric can close the performance gap that has traditionally forced edge developers to either downsize models or rely on cloud inference. CHIME’s ability to keep large LLMs on‑device while staying within strict power budgets opens new product categories—real‑time translation glasses, autonomous inspection drones, and privacy‑preserving personal assistants.
Why This Matters for AI Systems and Agents
From a system‑builder’s perspective, CHIME reshapes three core assumptions about edge AI deployment:
- Memory is Compute: Treating memory as an active compute substrate reduces the need for separate accelerators, simplifying board layouts and lowering BOM costs.
- Dynamic Workload Placement: The mapping framework can re‑allocate hot kernels at runtime, enabling adaptive agents that shift processing between RRAM and CPU based on workload characteristics.
- Privacy‑First Inference: With full model residency on‑device, sensitive data never leaves the hardware perimeter, aligning with emerging regulations on data sovereignty.
Practically, developers can integrate CHIME into existing pipelines by swapping the GPU backend for the CHIME runtime library, preserving most of their software stack while gaining a 5‑fold boost in token‑per‑second throughput.
For organizations building autonomous agents, the reduced latency translates directly into faster perception‑action loops, which is critical for safety‑critical domains such as robotics and vehicular AI.
Explore more about building edge AI pipelines on ubos.tech’s edge AI platform.
What Comes Next
While CHIME marks a significant step forward, several open challenges remain:
- Process Variability in RRAM: Analog compute is sensitive to device‑level variations; robust calibration techniques are needed for large‑scale production.
- Toolchain Maturity: The current mapping framework is prototype‑level; integrating with mainstream ML compilers (TVM, ONNX Runtime) will accelerate adoption.
- Security of In‑Memory Compute: Protecting RRAM‑resident weights from side‑channel attacks requires novel encryption‑aware compute primitives.
- Scalability to >100 B Parameters: Future multimodal models will dwarf current sizes; extending CHIME’s hierarchical chiplet interconnects will be essential.
Future research directions include:
- Co‑design of RRAM analog kernels for transformer attention patterns.
- Hybrid training pipelines that leverage CHIME’s near‑memory compute for on‑device fine‑tuning.
- Standardized APIs that expose heterogeneous memory as a first‑class resource to AI frameworks.
Potential applications span from real‑time robotics perception to low‑latency AR/VR assistants, where every millisecond counts.
Read the full CHIME paper on arXiv for a deeper technical dive.
Ready to experiment with heterogeneous memory acceleration? Visit our blog for tutorials, code samples, and hardware design guides.