Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

Authors: David Heurtel-Depeiges, Anian Ruoss, Joel Veness, Tim Genewein

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a large-scale empirical study to find a sweet spot where pre-trained vanilla transformers can achieve competitive compression ratios. To this end, we train models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG-XL, FLAC) even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of modeland dataset scale, and we investigate the effect of unimodal versus multimodal training.
Researcher Affiliation Collaboration David Heurtel-Depeiges * 1 Anian Ruoss * 2 Joel Veness 2 Tim Genewein 2 1Chandar Research Lab. MILA Quebec AI Institute Polytechnique Montréal 2Google Deep Mind.
Pseudocode No The paper describes methods and procedures but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We source all of our data from the following open-source Tensor Flow datasets (Pot et al., 2019): Text Since most of Tensor Flow s text datasets are quite small, we concatenate the following five datasets into a single collection of 165GB: (i) Wikipedia (Wikimedia, 2023),...; (ii) PG-19 (Rae et al., 2020),...; (iii) Big Patent (Sharma et al., 2019),...; (iv) Scientific Papers (Cohan et al., 2018),...; and (v) Natural Instructions (Mishra et al., 2022; Wang et al., 2022),.... Image We collect a subset of 165GB of the Image Net dataset (Russakovsky et al., 2015),.... Audio We create a subset of 165GB from the Common Voice dataset (Ardila et al., 2020),.... We use a 1GB subset of Reddit (Völske et al., 2017),.... Images We create a 1GB subset of the Celeb A HQ dataset (Liu et al., 2015)... Audio We use 1GB from the Libri Speech (Panayotov et al., 2015) dataset
Dataset Splits Yes We always evaluate on 1GB of out-of-distribution data, i.e., |uncompressed data| = 1GB. Dataset sizes were 20%, 40%, 60%, 80%, and 100% of the full 165GB for each training set mixture (uniand multimodal).
Hardware Specification Yes We trained every model on 16 NVIDIA A100 GPUs from our internal cluster. We ran Bellard s code on an NVIDIA Ge Force RTX 4090 GPU with a 24-core Intel i9-13900KF CPU @ 3Ghz.
Software Dependencies No The paper mentions software like TensorFlow datasets and the Adam optimizer, but it does not specify any version numbers for these or other software components.
Experiment Setup Yes We focus on decoder-only transformers (Vaswani et al., 2017) with Swi GLU activations (Shazeer, 2020) and post-layer normalization. Unless otherwise noted, we use 8 heads, an embedding dimension of 64, a context size of 4096 (bytes), and sliding windows without overlap or memory (full details in Appendix B.3). We use the Adam optimizer (Kingma & Ba, 2015) for 2.5 million steps with a batch size of 32... The learning rate was 1 10 4, and a sinusoid positional encoding was used. The number of layers was either 2, 4, 6, 8, or 10. For our sweep we used the same model parameters as in the previous paragraph (the training data size was always at 100%) and sweep over the following four context sizes (with training batch size in brackets): [1024 (128), 2048 (64), 4096 (32), 8192 (16)].