How Much Pre-training Is Enough to Discover a Good Subnetwork?

Authors: Cameron R. Wolfe, Fangshuo Liao, Qihan Wang, Junhyung Lyle Kim, Anastasios Kyrillidis

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, we empirically validate our theoretical results on multi-layer perceptions and residual-based convolutional networks trained on MNIST, CIFAR, and Image Net datasets.
Researcher Affiliation Academia Cameron R. Wolfe EMAIL Department of Computer Science Rice University Fangshuo Liao EMAIL Department of Computer Science Rice University Qihan Wang EMAIL Department of Computer Science Rice University Junhyung Lyle Kim EMAIL Department of Computer Science Rice University Anastasios Kyrillidis EMAIL Department of Computer Science Rice University
Pseudocode Yes Algorithm 1 Greedy Forward Selection Algorithm 2 Greedy Forward Selection for Deep CNN Algorithm 3 Distributed Greedy Forward Selection for Two-Layer Networks
Open Source Code No The paper mentions using a 'public implementation of greedy forward selection (Ye, 2021)' but does not state that the authors of this paper are releasing their own code or specific adaptations for the methodology described in this work. The reference is to a third-party's public code.
Open Datasets Yes Lastly, we empirically validate our theoretical results on multi-layer perceptions and residual-based convolutional networks trained on MNIST, CIFAR, and Image Net datasets. Experiments are run on an internal cluster with two Nvidia RTX 3090 GPUs using the public implementation of greedy forward selection (Ye, 2021). We perform structured pruning experiments with two-layer networks on MNIST (Deng, 2012) by pruning hidden neurons via greedy forward selection. We perform structured pruning experiments (i.e., channel-based pruning) using Res Net34 (He et al., 2015) and Mobile Net V2 (Sandler et al., 2018) architectures on CIFAR10 and Image Net (Krizhevsky et al., 2009; Deng et al., 2009).
Dataset Splits Yes To study how dataset size affects subnetwork performance, we construct sub-datasets of sizes 1K to 50K (i.e., in increments of 5K) from the original MNIST dataset by uniformly sampling examples from the ten original classes. Three CIFAR10 sub-datasets of size 10K, 30K, and 50K (i.e., full dataset) are created using uniform sampling across classes. This grid search is performed using a validation set on CIFAR10, constructed using a random 80-20 split on the training dataset.
Hardware Specification Yes Experiments are run on an internal cluster with two Nvidia RTX 3090 GPUs using the public implementation of greedy forward selection (Ye, 2021).
Software Dependencies No The paper mentions using a 'public implementation of greedy forward selection (Ye, 2021)' and adopts 'settings of a widely used, open-source repository [Pytorch-cifar. https://github.com/kuangliu/pytorch-cifar, 2017.]' but does not specify version numbers for any software components like Python, PyTorch, or other libraries.
Experiment Setup Yes The two-layer network is pre-trained for 8K iterations in total and pruned every 1K iterations to a size of 200 hidden nodes. Pre-training is conducted for 80K iterations using SGD with momentum and a cosine learning rate decay schedule starting at 0.1. We use a batch size of 128 and weight decay of 5e-4. The dense model is independently pruned every 20K iterations, and subnetworks are fine-tuned for 2500 iterations with an initial learning rate of 0.01 before being evaluated. We adopt ε = 0.02 and ε = 0.05 for Mobile Net-V2 and Res Net34, respectively... Models are pre-trained for 150 epochs using SGD with momentum and cosine learning rate decay with an initial value of 0.1. We use a batch size of 128 and weight decay of 5e-4. The dense network is independently pruned every 50 epochs, and the subnetwork is fine-tuned for 80 epochs using a cosine learning rates schedule with an initial value of 0.01 before being evaluated.