Linear Mode Connectivity between Multiple Models modulo Permutation Symmetries
Authors: Akira Ito, Masanori Yamada, Atsutoshi Kumagai
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we show that existing permutation search methods designed for two models can fail to transfer multiple models into the same convex low-loss basin. Next, we propose a permutation search method using a straight-through estimator for multiple models (STE-MM). We then experimentally demonstrate that even when multiple models are given, the test loss of the merged model remains nearly the same as the losses of the original models when using STE-MM, and the loss barriers between all permuted model pairs are also small. Additionally, from the perspective of the trace of the Hessian matrix, we show that the loss sharpness around the merged model decreases as the number of models increases with STE-MM, indicating that LMC for multiple models is more likely to hold. The source code implementing our method is available at https://github.com/e5-a/STE-MM. |
| Researcher Affiliation | Industry | 1NTT Social Informatics Laboratories 2NTT Computer and Data Science Laboratories. Correspondence to: Akira Ito <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 STE-MM Algorithm 2 WM Algorithm 3 Accelerated WM |
| Open Source Code | Yes | The source code implementing our method is available at https://github.com/e5-a/STE-MM. |
| Open Datasets | Yes | Three datasets were used in this study: MNIST (Lecun et al., 1998), Fashion-MNIST (FMNIST) (Xiao et al., 2017), and CIFAR10 (Krizhevsky et al., 2009). |
| Dataset Splits | No | The paper mentions using well-known datasets (MNIST, FMNIST, CIFAR10) which typically have standard splits. It also mentions using the training dataset to repair Batch Norm layers. However, it does not explicitly state the percentages or counts for training, validation, or test splits, nor does it reference a specific paper or resource that defines the exact splits used for its experiments, which is required for reproducibility. |
| Hardware Specification | Yes | All experiments were conducted on a Linux workstation with two AMD EPYC 7543 32-Core processors, eight NVIDIA A30 GPUs, and 512 GB of memory. |
| Software Dependencies | Yes | The Py Torch 2.5.13, Py Torch Lightning 2.4.04, and torchvision 0.20.15 libraries were used for model training and evaluation. |
| Experiment Setup | Yes | For training on the MNIST and FMNIST datasets, the Adam optimizer was utilized with a learning rate of 1 10 3. The batch size was fixed at 512, and training was conducted for a maximum of 100 epochs. We did not use a learning scheduler. Optimization was conducted using SGD with a learning rate of 0.4 and weight decay of 5 10 4. The batch size and maximum number of epochs were set to 500 and 100, respectively. For STE and STE-MM, the learning rate, number of epochs, and batch size were set to 0.001, 10, and 256, respectively. |