The KoLMogorov Test: Compression by Code Generation
Authors: Ori Yoran, Kunhao Zheng, Fabian Gloeckle, Jonas Gehring, Gabriel Synnaeve, Taco Cohen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly both GPT4-O and LLAMA-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches. |
| Researcher Affiliation | Collaboration | Ori Yoran1,2 , Kunhao Zheng1, Fabian Gloeckle1 Jonas Gehring1, Gabriel Synnaeve1, Taco Cohen1 1Meta AI (FAIR), 2Tel Aviv University EMAIL EMAIL |
| Pseudocode | Yes | Alg. 2 presents a formal definition of the KOLMOGOROV-TEST. CODELMS that generate shorter programs have better compression rates (see 4 for our metrics and experimental settings). Algorithm 1 Encoding a program under the uniform prior. |
| Open Source Code | Yes | To support future progress, we release our code, data, and a public leaderboard. ... we will release training code for SEQCODER-8B models on our synthetic pairs, enabling reproduction of our results and simplify training new SEQCODER models. |
| Open Datasets | Yes | Our datasets will be fully released, including our DSL and synthetic data generation framework. We will also release larger 1GB variants to facilitate future research. ... Audio. We randomly sample audio snippets from the Libri Speech development and test sets (Panayotov et al., 2015). ... Text. Following recent work (Deletang et al., 2024), we use the enwik9 Wikipedia corpus (Hutter, 2009). ... DNA. We use Genome assembly GRCh38 which contains 3.1GB of human DNA in FASTA format (NCBI, 2023). |
| Dataset Splits | Yes | As in the natural data domains, we generate 1MB of evaluation data. For training, we sample 1M pairs from the same distribution. ... We use a sequence length of 1024 bytes for our LMIC baseline ... For natural data, we use a sequence length of 128 bytes ... To better test if improvements on synthetic data generalize to real data, we evaluate SEQCODER models on 10,000 shorter Audio-MFCC sequences of lengths 16-64 ... Tab. 2 presents results for GZIP and LLAMA-3.1-405B on a random sample of 2,000 Audio-8-bit sequences of lengths 16-128. |
| Hardware Specification | No | The paper mentions model sizes like "LLAMA-3.1-405B" and "1.5-billion parameter LLM" but does not specify the particular GPU or CPU hardware used for training or inference. |
| Software Dependencies | No | The paper mentions software like "python implementation of GZIP", "Librosa", and "v LLM" for setting up inference servers, but it does not specify any version numbers for these software components. |
| Experiment Setup | Yes | We further train our models on 10K, 100K or 1M unique programs-sequence pairs sampled from our data generator ( 3.3)... We use a sequence length of 1024 bytes for our LMIC baseline... For natural data, we use a sequence length of 128 bytes... SEQCODER-1.5B and SEQCODER-8B models are trained for 20K and 10K steps, respectively... The number of initiators is uniformly sampled in range {1, ..., 5}, the length of each fixed sequence is sampled uniformly in range {5, ..., 25}. The probability to apply any non-mathematical modifier (e.g., repeat or substitute) is set to 0.4, and the probability for each specific modifier is set to 0.1. |