Language Modeling Is Compression

Abstract Commentary & Rating

Prof. Otto NomosMay 25, 2024 2 min read
blog-image-0

Published on Sep 19

Authors:Grégoire Delétang,Anian Ruoss,Paul-Ambroise Duquenne,Elliot Catt,Tim Genewein,Christopher Mattern,Jordi Grau-Moya,Li Kevin Wenliang,Matthew Aitchison,Laurent Orseau,Marcus Hutter,Joel Veness

Abstract

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

View arXiv pageView PDF

Commentary

The paper "Language Modeling Is Compression" revisits the well-established concept that predictive models can also be leveraged as lossless compressors and assesses how this idea can be applied to modern, large-scale language models.

Key Takeaways:

  1. Compression as Prediction: The paper pushes the idea of using predictive models (like modern LLMs) as efficient compressors. Given their strong predictive capacities, these models can compress a wide array of data types.

  2. General-Purpose Compressors: The research indicates that large language models, even if trained primarily on text, can efficiently compress non-textual data. For instance, Chinchilla 70B can compress ImageNet images and LibriSpeech audio samples better than domain-specific compressors like PNG and FLAC.

  3. Gaining Insights: Viewing prediction from a compression perspective can provide insights into various aspects of machine learning, such as scaling laws, tokenization, and in-context learning.

  4. Generative Models from Compressors: The equivalence of prediction and compression enables the creation of conditional generative models using any compressor.

Potential Real-World Impact:

  • Data Storage and Transfer: If LLMs can be effectively used as compressors, they may revolutionize data storage and transmission, particularly for rich media like images and audio.

  • Beyond Text: Demonstrating that a text-trained model can compress non-textual data opens doors to multi-modal applications and shows the generalization capacity of modern LLMs.

  • Better Understanding of LLMs: The compression viewpoint can provide deeper insights into the functioning and potential applications of large language models.

  • Generative Applications: The ability to transform any compressor into a conditional generative model can have wide-ranging implications in data generation, synthesis, and augmentation tasks.

Challenges:

  • Computational Resources: Using large language models as compressors may be computationally expensive, making them less accessible for real-time applications or for users with limited resources.

  • Domain Expertise: For some specific domains, specialized compressors might still be preferred due to domain-specific constraints and requirements.

Given the potential for breakthroughs in data storage, transmission, and the broader understanding of LLMs:

I'd rate the real-world impact of this paper as a 9 out of 10.

The bridge between prediction and compression is not entirely new, but the paper's application to modern LLMs and the results it achieves are notable. If these findings can be efficiently implemented, it might pave the way for novel applications and a deeper understanding of language models.

Share this article
Subscribe For The Latest Updates Subscribe to the newsletter and never miss the new post every week.