Do Deep Neural Network Solutions Form a Star Domain?

Authors: Ankit Sonthalia, Alexander Rubinstein, Ehsan Abbasnejad, Seong Joon Oh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We substantiate our star domain conjecture with empirical evidence by introducing the Starlight algorithm to identify a candidate star model for a given learning task. We demonstrate that these star model candidates have low loss barriers with an arbitrary set of solutions that were not used in constructing the star model candidates. This provides strong evidence that there exist star models that are linearly connected with other solutions. In Figure 2, we show loss barriers between two independently trained solutions (blue regular-regular curves). We observe that the loss increases and accuracy drops significantly at around t = 0.5, even after applying the algorithm (Ainsworth et al., 2022) to find the winning permutation. We present another piece of evidence that the convexity conjecture does not hold for thin Res Nets, reconfirming the findings of Ainsworth et al. (2022). Our experimental results confirm existing reports that the convexity conjecture requires very wide networks to hold, and has otherwise several failure cases for which we propose a relaxed version, viz., the star domain conjecture. We obtain strong empirical evidence that the star model found through Starlight is likely to be a true star model.
Researcher Affiliation Academia Ankit Sonthalia1 Alexander Rubinstein1 Ehsan Abbasnejad2,3 Seong Joon Oh1 1Tübingen AI Center, Universität Tübingen 2University of Adelaide 3Monash University
Pseudocode Yes Algorithm 1 Starlight: Training a Star Model. Input. dataset D = {(xi, yi)}I i=1, source models Z = {θ1, θ2, . . . , θN}, initial model θ0, learning rate λ, number of batches m, number of steps K. Set θ θ0.
Open Source Code Yes Our code is available at https://github.com/aktsonthalia/starlight.
Open Datasets Yes We describe our main findings with reference to Res Net-18 (He et al., 2016) models trained on CIFAR (Krizhevsky et al., 2012) using SGD... a large-scale dataset (Image Net-1k (Deng et al., 2009))... We use a minimal Vi T model with the MNIST dataset.
Dataset Splits No The paper mentions data augmentation techniques for CIFAR and ImageNet but does not explicitly state the training/validation/test splits (e.g., percentages or counts) for the datasets used in the main experiments. It describes how images are padded, cropped, or resized for training but not the overall partitioning of the dataset.
Hardware Specification Yes We used NVIDIA A100 GPUs for most of our experiments. All experiments were performed on single GPUs.
Software Dependencies No We use the Res Net18 implementation included in Py Torch (Paszke et al., 2019). We leverage the open-source library FFCV (Leclerc et al., 2023) to speed up our experiments. Our implementation leverages an open-source Python package called rebasin : https://pypi.org/project/rebasin/. The paper mentions several software components like Py Torch, FFCV, and rebasin, but it does not specify their version numbers.
Experiment Setup Yes Our model training hyperparameters largely reflect standard practices, but we describe them here for completeness. Res Net18 on CIFAR. For Res Net18 models trained on CIFAR-10 and CIFAR-100, we use a batch size of 128. We normalize the data using Image Net statistics. For data augmentation, we apply padding to the image or its horizontal mirror, and then randomly crop out a 32 32 region. We train for 200 epochs using SGD with momentum 0.9 and a weight decay of 5e 4. The initial learning rate is 0.1 and follows a cosine decay schedule to reach 0 by the end of training. Dense Net-40-12 on CIFAR. Dense Net uses a batch size of 64. The weight decay factor is 1e 4, and the models are trained for 300 epochs. The learning rate, initially 0.1, is multiplied with 0.1 at epochs 150 and 225. VGGs on CIFAR. The initial learning rate is set to 0.05 and is multiplied by 0.1 at epochs 100 and 150. Res Nets on Image Net. For Image Net, we use a batch size of 256. Models are trained for 100 epochs, using SGD with a learning rate of 0.1 which is multiplied by 0.1 at epochs 30, 60, 90. The weight decay factor is 1e 4.