reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The KoLMogorov Test: Compression by Code Generation

Authors: Ori Yoran, Kunhao Zheng, Fabian Gloeckle, Jonas Gehring, Gabriel Synnaeve, Taco Cohen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly both GPT4-O and LLAMA-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches.
Researcher Affiliation	Collaboration	Ori Yoran1,2 , Kunhao Zheng1, Fabian Gloeckle1 Jonas Gehring1, Gabriel Synnaeve1, Taco Cohen1 1Meta AI (FAIR), 2Tel Aviv University EMAIL EMAIL
Pseudocode	Yes	Alg. 2 presents a formal definition of the KOLMOGOROV-TEST. CODELMS that generate shorter programs have better compression rates (see 4 for our metrics and experimental settings). Algorithm 1 Encoding a program under the uniform prior.
Open Source Code	Yes	To support future progress, we release our code, data, and a public leaderboard. ... we will release training code for SEQCODER-8B models on our synthetic pairs, enabling reproduction of our results and simplify training new SEQCODER models.
Open Datasets	Yes	Our datasets will be fully released, including our DSL and synthetic data generation framework. We will also release larger 1GB variants to facilitate future research. ... Audio. We randomly sample audio snippets from the Libri Speech development and test sets (Panayotov et al., 2015). ... Text. Following recent work (Deletang et al., 2024), we use the enwik9 Wikipedia corpus (Hutter, 2009). ... DNA. We use Genome assembly GRCh38 which contains 3.1GB of human DNA in FASTA format (NCBI, 2023).
Dataset Splits	Yes	As in the natural data domains, we generate 1MB of evaluation data. For training, we sample 1M pairs from the same distribution. ... We use a sequence length of 1024 bytes for our LMIC baseline ... For natural data, we use a sequence length of 128 bytes ... To better test if improvements on synthetic data generalize to real data, we evaluate SEQCODER models on 10,000 shorter Audio-MFCC sequences of lengths 16-64 ... Tab. 2 presents results for GZIP and LLAMA-3.1-405B on a random sample of 2,000 Audio-8-bit sequences of lengths 16-128.
Hardware Specification	No	The paper mentions model sizes like "LLAMA-3.1-405B" and "1.5-billion parameter LLM" but does not specify the particular GPU or CPU hardware used for training or inference.
Software Dependencies	No	The paper mentions software like "python implementation of GZIP", "Librosa", and "v LLM" for setting up inference servers, but it does not specify any version numbers for these software components.
Experiment Setup	Yes	We further train our models on 10K, 100K or 1M unique programs-sequence pairs sampled from our data generator ( 3.3)... We use a sequence length of 1024 bytes for our LMIC baseline... For natural data, we use a sequence length of 128 bytes... SEQCODER-1.5B and SEQCODER-8B models are trained for 20K and 10K steps, respectively... The number of initiators is uniformly sampled in range {1, ..., 5}, the length of each fixed sequence is sampled uniformly in range {5, ..., 25}. The probability to apply any non-mathematical modifier (e.g., repeat or substitute) is set to 0.4, and the probability for each specific modifier is set to 0.1.