Learning $k$-Level Structured Sparse Neural Networks Using Group Envelope Regularization

Authors: Yehonathan Refael, Iftach Arbel, Wasim Huleihel

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we experiment and illustrate the efficiency of our proposed method in terms of the compression ratio, accuracy, and inference latency.
Researcher Affiliation Academia Yehonathan Refael EMAIL Department of Electrical Engineering-Systems Tel Aviv University Iftach Arbel EMAIL Independent Researcher Wasim Huleihel EMAIL Department of Electrical Engineering-Systems Tel Aviv University
Pseudocode Yes Algorithm 1: General Stochastic Proximal Gradient Method Algorithm 2: Learning structured k-level sparse neural-network by Prox SGD with WGSEF regularization Algorithm 3: General Stochastic Proximal Gradient Method
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets Yes These architectures were tested on the datasets CIFAR-10 Krizhevsky et al. (2009) and Fashion-MNIST Xiao et al. (2017). In this subsection, we compare our method to the state-of-the-art pruning techniques, which are often used as an alternative for model compression during (or, post) training. We train Resent50 with Image Net dataset. We examine the effectiveness of the WGSEF in the Le Net-5 convolutional neural network Le Cun et al. (1998) (the architecture is Pytorch and not Caffe and is given in Appendix A.5), on the MNIST dataset Le Cun & Cortes (2010). In Table 5, we present the results when training both VGG16 and Dense Net40 Huang et al. (2018a) on CIFAR-100 Krizhevsky et al.
Dataset Splits No The paper mentions using well-known datasets like CIFAR-10, Fashion-MNIST, ImageNet, MNIST, and CIFAR-100, which have standard splits. However, it does not explicitly state the specific split percentages, sample counts, or a detailed methodology for how the data was partitioned for its experiments. For instance, it refers to 'validation datasets' but not their size or origin.
Hardware Specification Yes Experiments were conducted using a mini-batch size of b = 128 on an A100 GPU.
Software Dependencies No The paper mentions 'Pytorch' in the context of the LeNet-5 architecture but does not specify a version number for PyTorch or any other software libraries or solvers used in the experiments.
Experiment Setup Yes All experiments were conducted over 300 epochs. For the first 150 epochs, we employed Algorithm 2, and for the leftover epochs, we used the HSPG with the WGSEF acting as a regularizer (i.e., Algorithm 3). Experiments were conducted using a mini-batch size of b = 128 on an A100 GPU. The coefficient for the WGSE regularizer was set to λ = 10 2. Again, to have a fair comparison, the baseline model was trained using SGD, both with an initial learning rate of α0 = 0.01, regularization magnitude λ = 0.03, a batch size of 128, and a cosine annealing learning rate scheduler. We train Resent50 with Image Net dataset, using λ = 0.05, with an initial learning rate of α0 = 0.01, sparsity level k = 0.34, and use a cosine annealing learning rate scheduler. The networks were trained with a learning rate of 0.001, regularization magnitude λ = 10 5, and a batch size of 32 for 150 epochs across 5 runs. We use a learning rate equal to 1e 4, with a batch size of 32, a momentum 0.95, and 15 epochs.