Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition
Authors: Ismail Alkhouri, Xitong Zhang, Rongrong Wang
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results (i) demonstrate the effectiveness of our approach using MNIST on Fully Connected Networks, CIFAR10 on Vision Transformers, and CIFAR10/100 and Image Net on Convolutional Neural Networks, and (ii) illustrate that we achieve either competitive or state-of-the-art results when compared to leading structured pruning and low-rank training methods in terms of FLOPs and parameters drop. |
| Researcher Affiliation | Academia | Ismail R. Alkhouri EMAIL; EMAIL Department of Computational Mathematics, Science & Engineering Michigan State University Department of Electrical Engineering & Computer Science University of Michigan Ann Arbor Xitong Zhang EMAIL Department of Computational Mathematics, Science & Engineering Michigan State University Rongrong Wang EMAIL Department of Computational Mathematics, Science & Engineering Department of Mathematics Michigan State University |
| Pseudocode | Yes | Algorithm 1 Compression with Lo RITa+SVT. Input: L trainable weights Wi, i [L], factorization parameter N > 1, and singular value truncation parameter r. Output: Compressed and trained Weights. |
| Open Source Code | Yes | Our code is available at https://github.com/Xitong System/Lo RITa/tree/main. |
| Open Datasets | Yes | Our experimental results (i) demonstrate the effectiveness of our approach using MNIST on Fully Connected Networks, CIFAR10 on Vision Transformers, and CIFAR10/100 and Image Net on Convolutional Neural Networks... |
| Dataset Splits | No | The paper mentions using well-known datasets like MNIST, CIFAR10, CIFAR100, and Image Net but does not explicitly provide specific training/validation/test splits, percentages, or methodology for reproducing the data partitioning for the main experiments. It mentions '120 randomly subsampled training data to compute E(l)' in Appendix A, but this is for an internal iterative process, not the overall dataset split for model training and evaluation. |
| Hardware Specification | No | The paper states, 'We use Py Torch to conduct our experiments,' but does not provide any specific details regarding the hardware used for these experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions 'We use Py Torch to conduct our experiments,' but it does not specify the version number of PyTorch or any other software dependencies required to reproduce the experimental setup. |
| Experiment Setup | Yes | First, we evaluate our proposed method on fully connected neural networks, varying the number of layers, utilizing the Adam optimizer with a learning rate set to 1 10 2, and employing a constant layer dimension of 96 (other than the last). Overparameterization is applied across all layers in the model. To ensure a fair comparison, we begin by tuning the baseline model (N = 1) across a range of weight decay parameters {5 10 6, 1 10 5, 2 10 4, 5 10 5, 1 10 4, 2 10 4}. Subsequently, we extend our exploration of weight decay within the same parameter range for models with N > 1. ... The learning rate applied in this evaluation is set to 3 10 4. The weight decay was searched over {1 10 2, 5 10 3, 1 10 3} for CIFAR10 and {1 10 5, 5 10 5, 1 10 4} for CIFAR100. ... All the considered Vi T models underwent optimization via the Adam optimizer with a learning rate of 3 10 4. The hidden dimension is 256 for all Vi Ts. ... we initially fine-tuned the baseline model (N = 1) across the following weight decay parameters {5 10 5, 1 10 4, 2 10 4, 5 10 4, 1 10 3, 2 10 3}. |