reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation

Authors: Lu Li, Tianyu Zhang, Zhiqi Bu, Suyuchen Wang, Huan He, Jie Fu, Yonghui Wu, Jiang Bian, Yong Chen, Yoshua Bengio

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on vision and natural language processing tasks demonstrate that MAP can accurately identify the Pareto front, providing practitioners with flexible solutions to balance competing task objectives. We validate our method across a diverse set of tasks, spanning from vision to natural language processing, and demonstrate its applicability to a variety of architectures, including Res Nets (He et al., 2016), Vi T (Dosovitskiy et al., 2020), and large language models (Brown et al., 2020; Rozière et al., 2023; Touvron et al., 2023; Jiang et al., 2024). Our results confirm that this novel approach supports the seamless integration of diverse model capabilities and aligns more closely with various real-world preferences by providing a set of optimal fronts across the tasks.
Researcher Affiliation	Collaboration	Lu Li1, Tianyu Zhang2,3, Zhiqi Bu4*, Suyuchen Wang 5, Huan He1, Jie Fu6, Jiang Bian7, Yonghui Wu7, Yong Chen1, Yoshua Bengio2 1 University of Pennsylvania 2 MILA 3 Service Now 4 Amazon AGI 5 Université de Montréal 6 HKUST 7 University of Florida
Pseudocode	Yes	Algorithm 1 MAP Algorithm 2 Nested-merging MAP Algorithm 3 Bayesian Adaptive of Surrogate Model
Open Source Code	Yes	Our code is available at https://github. com/luli-git/MAP.
Open Datasets	Yes	We evaluate multi-task model merging on eight zero-shot image classification datasets following (Ilharco et al., 2022): SUN397 (Xiao et al., 2016), Cars (Krause et al., 2013), GTSRB (Stallkamp et al., 2011), MNIST (Le Cun, 1998), Euro SAT (Helber et al., 2019), SVHN (Netzer et al., 2011), DTD (Cimpoi et al., 2014), and RESISC45 (Cheng et al., 2017). We use the Vi T-B/32 architecture in CLIP (Radford et al., 2021) as the pre-trained model for the experiments on vision tasks discussed in the main text. We show the results of these experiments in the main pages. In addition to natural images, we used another dataset consisting of over 112,000 chest X-rays and 30,000 unique patients (of Health et al., 2017). We performed additional experiments on Res Net18 (He et al., 2016) by merging two models fine-tuned on CIFAR10 (Krizhevsky et al., 2009) and Flowers102 (Nilsback & Zisserman, 2008).
Dataset Splits	No	The paper primarily evaluates pre-trained/fine-tuned models in a zero-shot or model merging context, rather than performing new training that requires explicit dataset splits for reproduction. While it mentions splitting the NIH ChestX-ray dataset into two task groups, it does not provide specific training/test/validation splits for any of the datasets used within the scope of its own experiments for reproducibility.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments. It only generally discusses computational costs and efficiency.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) that are needed to replicate the experiment.
Experiment Setup	Yes	Algorithm 1 MAP Input: Pretrained model θpre, fine-tuned models {θn ft}N n=1. Compute task vectors {vn = θn ft θpre \| n 1, . . . , N}. Sample K vectors of c RN. Denote the set as Ω. Algorithm 3 Bayesian Adaptive of Surrogate Model Input: Number of iterations J, Buffer B, Pretrained model θpre, Task vectors vn, Evaluators for task N, Mn( ), Discretization bin number K, sample size for every iteration nj, j = 0 to J, Bootstrap dropping rate α = 20%, Bootstrap sampling number Q = 30.