Do Deep Neural Network Solutions Form a Star Domain?
Authors: Ankit Sonthalia, Alexander Rubinstein, Ehsan Abbasnejad, Seong Joon Oh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We substantiate our star domain conjecture with empirical evidence by introducing the Starlight algorithm to identify a candidate star model for a given learning task. We demonstrate that these star model candidates have low loss barriers with an arbitrary set of solutions that were not used in constructing the star model candidates. This provides strong evidence that there exist star models that are linearly connected with other solutions. In Figure 2, we show loss barriers between two independently trained solutions (blue regular-regular curves). We observe that the loss increases and accuracy drops significantly at around t = 0.5, even after applying the algorithm (Ainsworth et al., 2022) to find the winning permutation. We present another piece of evidence that the convexity conjecture does not hold for thin Res Nets, reconfirming the findings of Ainsworth et al. (2022). Our experimental results confirm existing reports that the convexity conjecture requires very wide networks to hold, and has otherwise several failure cases for which we propose a relaxed version, viz., the star domain conjecture. We obtain strong empirical evidence that the star model found through Starlight is likely to be a true star model. |
| Researcher Affiliation | Academia | Ankit Sonthalia1 Alexander Rubinstein1 Ehsan Abbasnejad2,3 Seong Joon Oh1 1Tübingen AI Center, Universität Tübingen 2University of Adelaide 3Monash University |
| Pseudocode | Yes | Algorithm 1 Starlight: Training a Star Model. Input. dataset D = {(xi, yi)}I i=1, source models Z = {θ1, θ2, . . . , θN}, initial model θ0, learning rate λ, number of batches m, number of steps K. Set θ θ0. |
| Open Source Code | Yes | Our code is available at https://github.com/aktsonthalia/starlight. |
| Open Datasets | Yes | We describe our main findings with reference to Res Net-18 (He et al., 2016) models trained on CIFAR (Krizhevsky et al., 2012) using SGD... a large-scale dataset (Image Net-1k (Deng et al., 2009))... We use a minimal Vi T model with the MNIST dataset. |
| Dataset Splits | No | The paper mentions data augmentation techniques for CIFAR and ImageNet but does not explicitly state the training/validation/test splits (e.g., percentages or counts) for the datasets used in the main experiments. It describes how images are padded, cropped, or resized for training but not the overall partitioning of the dataset. |
| Hardware Specification | Yes | We used NVIDIA A100 GPUs for most of our experiments. All experiments were performed on single GPUs. |
| Software Dependencies | No | We use the Res Net18 implementation included in Py Torch (Paszke et al., 2019). We leverage the open-source library FFCV (Leclerc et al., 2023) to speed up our experiments. Our implementation leverages an open-source Python package called rebasin : https://pypi.org/project/rebasin/. The paper mentions several software components like Py Torch, FFCV, and rebasin, but it does not specify their version numbers. |
| Experiment Setup | Yes | Our model training hyperparameters largely reflect standard practices, but we describe them here for completeness. Res Net18 on CIFAR. For Res Net18 models trained on CIFAR-10 and CIFAR-100, we use a batch size of 128. We normalize the data using Image Net statistics. For data augmentation, we apply padding to the image or its horizontal mirror, and then randomly crop out a 32 32 region. We train for 200 epochs using SGD with momentum 0.9 and a weight decay of 5e 4. The initial learning rate is 0.1 and follows a cosine decay schedule to reach 0 by the end of training. Dense Net-40-12 on CIFAR. Dense Net uses a batch size of 64. The weight decay factor is 1e 4, and the models are trained for 300 epochs. The learning rate, initially 0.1, is multiplied with 0.1 at epochs 150 and 225. VGGs on CIFAR. The initial learning rate is set to 0.05 and is multiplied by 0.1 at epochs 100 and 150. Res Nets on Image Net. For Image Net, we use a batch size of 256. Models are trained for 100 epochs, using SGD with a learning rate of 0.1 which is multiplied by 0.1 at epochs 30, 60, 90. The weight decay factor is 1e 4. |