reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EvoPress: Accurate Dynamic Model Compression via Evolutionary Search

Authors: Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Via Evo Press, we achieve stateof-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/Evo Press.
Researcher Affiliation	Collaboration	1ETH Z urich 2Yandex Research 3IST Austria 4Red Hat AI. Correspondence to: Dan Alistarh <EMAIL>.
Pseudocode	Yes	Algorithm 1: Evo Press: A (1 + λ)-Evolutionary Algorithm with Level-Switch Mutation and Multi Step Selection for Maximizing f : [m]n R.
Open Source Code	Yes	Our code is available at https://github.com/IST-DASLab/Evo Press.
Open Datasets	Yes	We follow a standard evaluation protocol (Frantar et al., 2022), measuring perplexity on the Wiki Text-2 (Merity et al., 2016) and C4 (Raffel et al., 2019) datasets for language performance and accuracy on zero-shot evaluations on standard benchmarks: Wino Grande (Sakaguchi et al., 2021), Pi QA (Tata & Patel, 2003), Hella Swag (Zellers et al., 2019), ARC-easy and ARC-challenge (Clark et al., 2018) via the LM Eval Harness (Gao et al., 2021). For this purpose, we use Fineweb-Edu (Penedo et al., 2024) as a source of clean and diverse calibration data.
Dataset Splits	Yes	We follow a standard evaluation protocol (Frantar et al., 2022), measuring perplexity on the Wiki Text-2 (Merity et al., 2016) and C4 (Raffel et al., 2019) datasets for language performance and accuracy on zero-shot evaluations on standard benchmarks: Wino Grande (Sakaguchi et al., 2021), Pi QA (Tata & Patel, 2003), Hella Swag (Zellers et al., 2019), ARC-easy and ARC-challenge (Clark et al., 2018) via the LM Eval Harness (Gao et al., 2021). Following Egiazarian et al. (2024), we fix the total number of calibration tokens to 8 million (8M).
Hardware Specification	Yes	the full version of Evo Press, applied at high compression granularity, will converge in a few hours on a single RTX 3090 GPU, and we also present a lightweight version which utilizes fewer samples and converges in 1 hour in the same setting, on an 8B-parameter model. We report the runtime on a single NVIDIA RTX 4090 GPU.
Software Dependencies	No	The paper mentions using specific methodologies like "Sparse GPT (Frantar & Alistarh, 2023)" and "GPTQ (Frantar et al., 2022)", but does not provide specific version numbers for any software libraries or tools.
Experiment Setup	Yes	Here, we provide an overview of the hyperparameters used in our experiments. As shown in Table 8, we employed different choices for the number of tokens, offspring, and generations for different applications to account for the size of the respective search space. For example, for Unstructured Sparsity, it specifies 400 Generations, 64 Offspring, 8 Survivors (1), 2048 Tokens (1), 2 Survivors (2), 16384 Tokens (2), 1 Survivors (3), 65536 Tokens (3).