reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MERGE$^3$: Efficient Evolutionary Merging on Consumer-grade GPUs

Authors: Tommaso Mencattini, Robert Adrian Minut, Donato Crisostomi, Andrea Santilli, Emanuele Rodolà

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that MERGE3 effectively transfers mathematical skills by merging a strong math model with three language-specific models, achieving 10–20% higher accuracy than standard merging baselines in each language. Building on this, we evolve a single multilingual model by merging Italian, English, German, and Dutch models, outperforming individually fine-tuned models by up to 19% on ARC (Clark et al., 2018), a widely used benchmark for reasoning. Furthermore, MERGE3 achieves competitive accuracy on Japanese GSM8K (Cobbe et al., 2021), matching models evolved on full datasets while maintaining high efficiency, demonstrating that our evolutionary strategy preserves performance while drastically reducing computational costs. Section 4: Experiments
Researcher Affiliation	Academia	1School of Computer and Communication Science, Ecole Polytechnique F ed erale de Lausanne, Lausanne, Switzerland 2Department of Computer Science, Sapienza University of Rome, Rome, Italy. Correspondence to: Tommaso Mencattini <EMAIL>.
Pseudocode	Yes	Our method MERGE3 speeds up evolutionary model merging by reducing the computational cost of fitness eval- uation. It achieves this by shrinking the fitness evaluation dataset and using IRT-based performance estimators to maintain full-dataset accuracy from subset evaluations. Figure 2 shows an overview of our method, while we present below the pseudo-code for the end-to-end MERGE3 algorithm. Algorithm 1 The full MERGE3 algorithm.
Open Source Code	Yes	We provide theoretical guarantees and an open-source library, democratizing highquality model merging. github.com/tommasomncttn/merge3. Each experiment was run using a library developed specifically for this paper, which will be released as open-source software, called Mergenetic (Minut et al., 2025).
Open Datasets	Yes	We first validate the proposed ability and performance estimators, assessing their accuracy in approximating full-dataset evaluations by comparing them against standard P-IRT and GP-IRT estimators (Polo et al., 2024) across five benchmark datasets: GSM8K (Cobbe et al., 2021), Winogrande (Sakaguchi et al., 2021), Truthful QA (Lin et al., 2022), Hellaswag (Zellers et al., 2019), and ARC (Clark et al., 2018).
Dataset Splits	Yes	The fitness dataset was extracted from the test set of GSM8K, and we used the remaining, nonoverlapping samples as test set for evaluating the model. To get the language-specific versions of GSM8K, we used Unbabel/Tower Instruct-7B-v0.2 (Alves et al., 2024) to translate the datasets. In each experiment, the population size was fixed to 25 and the number of iterations to 7. ... we deployed four different sizes of the fitness datasets for Japanese, namely 20, 30, 50, and 100, in order to obtain a more detailed analysis of the method for comparison with the work of (Akiba et al., 2025). On the other hand, we kept the fitness dataset size fixed to 20 for all other aforemen- tioned experiments.
Hardware Specification	Yes	In this paper, we address this challenge by introducing MERGE3, an evolutionary merging framework that runs on a single consumer GPU with competitive results (see fig. 1). Unlike the competing approach, MERGE3 operates with just 0.077 × 106 TFLOPs, namely a 50-fold reduction. This drastic decrease in computational cost makes it feasible on consumer hardware, freeing up FLOPs for further optimization or additional tasks. All the merging experiments were performed with our custom-made library Mergenetic (see Appendix A) on a RTX 4090 GPU featuring 24 GB of VRAM, while employing a batch size of 8, 4-bit quantization, and models comprising 7 billion parameters (see Appendix B). ... To compare the efficiency of different model evaluation strategies, we measured the time required to evolve merged LLM models using a single NVIDIA 4090 with 24GB of VRAM, and report the Throughput R in table 9. We also benchmark evaluation and merging times across three GPU models (3090, 4090, V100) to illustrate practical runtimes for MERGE3 on both modern and older hardware. We report the results in table 10.
Software Dependencies	No	The implementation relies on Mergekit (Goddard et al., 2024) for merging the models, Pymoo (Blank & Deb, 2020) for optimizing the objective function through evolutionary algorithms, and Lm-Evaluation-Harness (Gao et al., 2024) for implementing some of the fitness functions. We used the implementation from Polo et al. (2024) and adopted their configuration settings. Specifically, we used γm ∼ N(µγ1d, 1/uγId), αi ∼ N(µα1d, 1/uαId), and βi ∼ N(µβ, 1/uβ). Following Polo et al. (2024), we also applied (hyper)priors to the prior parameters using the software for fitting hierarchical Bayesian models (Lalor & Rodriguez, 2023).
Experiment Setup	Yes	All the merging experiments were performed with our custom-made library Mergenetic (see Appendix A) on a RTX 4090 GPU featuring 24 GB of VRAM, while employing a batch size of 8, 4-bit quantization, and models comprising 7 billion parameters (see Appendix B). In each of these experiments, we deployed an ad-hoc genetic algorithm for single-objective optimization. We employed the Simulated Binary Crossover (Deb et al., 2007) operator to generate offspring solutions by combining parent solutions. To maintain diversity and explore the search space, we applied Polynomial Mutation (Deb et al., 2007), which introduces small perturbations to offspring solutions and enhances the algorithm’s ability to escape local optima. This combination of SBX and PM effectively balances exploration and exploitation, facilitating efficient convergence toward optimal solutions. ... We deployed four different sizes of the fitness datasets for Japanese, namely 20, 30, 50, and 100... the population size was fixed to 25 and the number of iterations to 7. ... The γ model dimensionality is set to 15 following the parameter choice suggested by Polo et al. (2024). ... In the experiments reported in the main paper (section 4.1) and above (appendix C.2.2), we used as a heuristic c = 1/2.