reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Linear Mode Connectivity between Multiple Models modulo Permutation Symmetries

Authors: Akira Ito, Masanori Yamada, Atsutoshi Kumagai

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we show that existing permutation search methods designed for two models can fail to transfer multiple models into the same convex low-loss basin. Next, we propose a permutation search method using a straight-through estimator for multiple models (STE-MM). We then experimentally demonstrate that even when multiple models are given, the test loss of the merged model remains nearly the same as the losses of the original models when using STE-MM, and the loss barriers between all permuted model pairs are also small. Additionally, from the perspective of the trace of the Hessian matrix, we show that the loss sharpness around the merged model decreases as the number of models increases with STE-MM, indicating that LMC for multiple models is more likely to hold. The source code implementing our method is available at https://github.com/e5-a/STE-MM.
Researcher Affiliation	Industry	1NTT Social Informatics Laboratories 2NTT Computer and Data Science Laboratories. Correspondence to: Akira Ito <EMAIL>.
Pseudocode	Yes	Algorithm 1 STE-MM Algorithm 2 WM Algorithm 3 Accelerated WM
Open Source Code	Yes	The source code implementing our method is available at https://github.com/e5-a/STE-MM.
Open Datasets	Yes	Three datasets were used in this study: MNIST (Lecun et al., 1998), Fashion-MNIST (FMNIST) (Xiao et al., 2017), and CIFAR10 (Krizhevsky et al., 2009).
Dataset Splits	No	The paper mentions using well-known datasets (MNIST, FMNIST, CIFAR10) which typically have standard splits. It also mentions using the training dataset to repair Batch Norm layers. However, it does not explicitly state the percentages or counts for training, validation, or test splits, nor does it reference a specific paper or resource that defines the exact splits used for its experiments, which is required for reproducibility.
Hardware Specification	Yes	All experiments were conducted on a Linux workstation with two AMD EPYC 7543 32-Core processors, eight NVIDIA A30 GPUs, and 512 GB of memory.
Software Dependencies	Yes	The Py Torch 2.5.13, Py Torch Lightning 2.4.04, and torchvision 0.20.15 libraries were used for model training and evaluation.
Experiment Setup	Yes	For training on the MNIST and FMNIST datasets, the Adam optimizer was utilized with a learning rate of 1 10 3. The batch size was fixed at 512, and training was conducted for a maximum of 100 epochs. We did not use a learning scheduler. Optimization was conducted using SGD with a learning rate of 0.4 and weight decay of 5 10 4. The batch size and maximum number of epochs were set to 500 and 100, respectively. For STE and STE-MM, the learning rate, number of epochs, and batch size were set to 0.001, 10, and 256, respectively.