Sparse Training from Random Initialization: Aligning Lottery Ticket Masks using Weight Symmetry
Authors: Mohammed Adnan, Rohan Jain, Ekansh Sharma, Rahul Krishnan, Yani Ioannou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show a significant increase in generalization when sparse training from random initialization with the permuted mask as compared to using the non-permuted LTH mask, on multiple datasets (CIFAR-10/100 & Image Net) and models (VGG11 & Res Net20/50). Our codebase for reproducing the results is publicly available at here. To empirically validate our hypothesis, we obtain a sparse mask using Iterative Magnitude Pruning (IMP) (Renda et al., 2020; Han et al., 2015) on model A (from Figure 1) and show that given a permutation that aligns the optimization basin of model A and a new random initialization, the mask can be reused. |
| Researcher Affiliation | Academia | 1 Schulich School of Engineering, University of Calgary 2Vector Institute for AI 3Dept. of Computer Science, University of Toronto. Correspondence to: Mohammed Adnan <EMAIL>, Yani Ioannou <EMAIL>. |
| Pseudocode | No | The paper describes steps for pruning in Appendix A.3 but these are high-level descriptions rather than structured, code-like pseudocode or algorithm blocks. For example, '1. In an unstructured, global manner, we identify and mask (set to zero) the smallest 20% of unpruned weights based on their magnitude. 2. This process is repeated for s rounds to achieve the target sparsity S, with each subsequent round pruning 20% of the remaining weights. 3. During each round, the model is trained for train_epochs_per_prune epochs.' is a descriptive list, not pseudocode. |
| Open Source Code | No | Our codebase for reproducing the results is publicly available at here. The term "here" is not a direct URL in the provided text, thus access to the source code is not concretely provided within the text itself. |
| Open Datasets | Yes | We empirically show a significant increase in generalization when sparse training from random initialization with the permuted mask as compared to using the non-permuted LTH mask, on multiple datasets (CIFAR-10/100 & Image Net) and models (VGG11 & Res Net20/50). For our set of experiments we used the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). We also validated our hypothesis on the ILSVRC 2012 (Image Net) dataset... (Deng et al., 2009). |
| Dataset Splits | Yes | For our set of experiments we used the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). We apply the following standard data augmentation techniques to the training set: Random Horizontal Flip: Randomly flips the image horizontally with a given probability (by default, 50%). Random Crop: Randomly crops the image to a size of 32 × 32 pixels, with a padding of 4 pixels around the image. For our set of experiments we used the Image Net dataset (Deng et al., 2009). We apply the following standard data augmentation techniques to the training set: Random Horizontal Flip: Randomly flips the image horizontally with a given probability (by default, 50%). Random Resized Crop: Randomly crops a region from the image and resizes it to 224 × 224 pixels. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It generally mentions 'reducing the computational cost' and 'resource-constrained devices' but no concrete specifications for their own experimental setup. |
| Software Dependencies | No | The paper mentions using 'Py Torch s torch.nn.utils.prune library (Paganini & Forde, 2020)' and 'Py Torch s distributed data parallel codebase for training models on Image Net (Paszke et al., 2019)'. While PyTorch is mentioned, specific version numbers for PyTorch or other critical libraries are not provided. |
| Experiment Setup | Yes | We use the following hyperparameters for Res Net20 and VGG11 trained on CIFAR-10/100, as outlined in Table 2. Hyperparameter Value Optimizer SGD Momentum 0.9 Dense Learning Rate 0.08 Sparse Learning Rate 0.02 Weight Decay 5e-4 Batch Size 128 Epochs (T) 200. We use the following hyperparameters for Res Net50 trained on Image Net, as outlined in Table 3. Hyperparameter Value Optimizer SGD Momentum 0.9 Dense Learning Rate 0.4 Sparse Learning Rate 0.4 Weight Decay 1e-4 Batch Size 1024 Epochs (T) 80. Hyperparameter Res Net20/VGG11 Res Net50 train_epochs_per_prune 50 20 Learning Rate 0.01 0.04 (Table 4). |