- Updated: March 19, 2024
- 2 min read
Extreme Compression of Large Language Models via Additive Quantization
Introduction
In the rapidly advancing world of machine learning, large language models (LLMs) have become a key player. These models are central to numerous applications, from chatbots to content generation, and their importance is only growing. However, their size presents a significant challenge when it comes to deploying them on end-user devices. This has led to a race towards developing effective quantization techniques for these models.
Summary of the Paper
A recent paper titled “Extreme Compression of Large Language Models via Additive Quantization” by Vage Egiazarian and his team, addresses this challenge head-on. This paper revisits the problem of “extreme” LLM compression, defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter, from the perspective of classic methods in Multi-Codebook Quantization (MCQ).
Explanation of the Paper
The team’s work builds on Additive Quantization, a classic algorithm from the MCQ family, and adapts it to the quantization of language models. The resulting algorithm advances the state-of-the-art in LLM compression, outperforming all recently proposed techniques in terms of accuracy at a given compression budget. For instance, when compressing Llama 2 models to 2 bits per parameter, their algorithm quantizes the 7B model to 6.93 perplexity, the 13B model to 5.70 perplexity, and the 70B model to 3.94 perplexity on WikiText2.
Conclusion
The team’s implementation of Additive Quantization for Language Models (AQLM) is released as a baseline to facilitate future research in LLM quantization. This is a significant contribution to the machine learning community, as it provides a solid foundation for further advancements in the field of LLM compression.
For more detailed insights and a deeper understanding of the methodologies used, you can access the full paper at arXiv:2401.06118v2.