EvoPress: Accurate Dynamic Model Compression via Evolutionary Search
Authors: Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Via Evo Press, we achieve stateof-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths. Our code is available at https://github.com/IST-DASLab/Evo Press. |
| Researcher Affiliation | Collaboration | 1ETH Z urich 2Yandex Research 3IST Austria 4Red Hat AI. Correspondence to: Dan Alistarh <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: Evo Press: A (1 + λ)-Evolutionary Algorithm with Level-Switch Mutation and Multi Step Selection for Maximizing f : [m]n R. |
| Open Source Code | Yes | Our code is available at https://github.com/IST-DASLab/Evo Press. |
| Open Datasets | Yes | We follow a standard evaluation protocol (Frantar et al., 2022), measuring perplexity on the Wiki Text-2 (Merity et al., 2016) and C4 (Raffel et al., 2019) datasets for language performance and accuracy on zero-shot evaluations on standard benchmarks: Wino Grande (Sakaguchi et al., 2021), Pi QA (Tata & Patel, 2003), Hella Swag (Zellers et al., 2019), ARC-easy and ARC-challenge (Clark et al., 2018) via the LM Eval Harness (Gao et al., 2021). For this purpose, we use Fineweb-Edu (Penedo et al., 2024) as a source of clean and diverse calibration data. |
| Dataset Splits | Yes | We follow a standard evaluation protocol (Frantar et al., 2022), measuring perplexity on the Wiki Text-2 (Merity et al., 2016) and C4 (Raffel et al., 2019) datasets for language performance and accuracy on zero-shot evaluations on standard benchmarks: Wino Grande (Sakaguchi et al., 2021), Pi QA (Tata & Patel, 2003), Hella Swag (Zellers et al., 2019), ARC-easy and ARC-challenge (Clark et al., 2018) via the LM Eval Harness (Gao et al., 2021). Following Egiazarian et al. (2024), we fix the total number of calibration tokens to 8 million (8M). |
| Hardware Specification | Yes | the full version of Evo Press, applied at high compression granularity, will converge in a few hours on a single RTX 3090 GPU, and we also present a lightweight version which utilizes fewer samples and converges in 1 hour in the same setting, on an 8B-parameter model. We report the runtime on a single NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using specific methodologies like "Sparse GPT (Frantar & Alistarh, 2023)" and "GPTQ (Frantar et al., 2022)", but does not provide specific version numbers for any software libraries or tools. |
| Experiment Setup | Yes | Here, we provide an overview of the hyperparameters used in our experiments. As shown in Table 8, we employed different choices for the number of tokens, offspring, and generations for different applications to account for the size of the respective search space. For example, for Unstructured Sparsity, it specifies 400 Generations, 64 Offspring, 8 Survivors (1), 2048 Tokens (1), 2 Survivors (2), 16384 Tokens (2), 1 Survivors (3), 65536 Tokens (3). |