A Latent Variable Approach to Quantization for Large Language Models
General Exam presented by Phillip H KeungLarge language models (LLMs) have been widely adopted for various tasks in natural language processing (NLP), and many commercially-available LLMs are competitive with task-specific models in multilingual settings, even when such models were hand-tuned by human experts. However, the cost of model deployment is prohibitive due to the hundreds of billions of parameters in state-of-the-art LLMs. Model compression techniques (e.g., quantization, weight pruning) have been developed to reduce the hardware requirements and increase token throughput at inference time. In this talk, we present a new latent variable formulation for post-training quantization, which allows the joint learning of the quantization grid, scaling constants, and grid assignment for the elements of the parameter matrices. Prior methods assumed that the values of the quantization grid and scaling constants were fixed, which may limit their ability to compress models effectively. We show that our approach outperforms common quantization techniques on publicly-available LLMs, and discuss planned experiments and possible extensions to efficient model adaptation.