An Efficient Training Algorithm for Models with Block-wise Sparsity
Authors: Ding Zhu, Zhiqun Zuo, Mohammad Mahdi Khalili
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive empirical and theoretical analyses show that our algorithms can decrease the computation and memory costs significantly without a performance drop compared to baselines. Through extensive empirical study, we show that in some cases, our proposed method can reduce the number of training parameters and training FLOPs by 97% with a minimal accuracy drop. |
| Researcher Affiliation | Academia | Ding Zhu EMAIL Department of Computer Science and Engineering The Ohio State University Zhiqun Zuo EMAIL Department of Computer Science and Engineering The Ohio State University Mohammad Mahdi Khalili EMAIL Department of Computer Science and Engineering The Ohio State University |
| Pseudocode | No | The paper includes mathematical formulations, propositions, and proofs, but does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is released or provide a link to a code repository. It only mentions the use of the PyTorch package. |
| Open Datasets | Yes | The MNIST dataset contains 60000 training images and 10000 testing images which are gray pictures of digits 0 to 9. The size of each image is 28 28 pixels. ... We conduct an experiment with our approach with the Vi T-tiny, Vi T-base(Dosovitskiy et al., 2021; Touvron et al., 2021) and Swin-transformer Tiny (Liu et al., 2021b) on CIFAR-100 image classification dataset (He et al., 2016). |
| Dataset Splits | Yes | The MNIST dataset contains 60000 training images and 10000 testing images which are gray pictures of digits 0 to 9. ... The dataset has 60 thousand pictures of 100 different categories. The model is trained using this dataset for 300 epochs. |
| Hardware Specification | Yes | we used a server with 64 CPUs of AMD EPYC 7313 16-Core Processor. The server has 8 RTX A5000 GPUs, with 24GB memory for each one. |
| Software Dependencies | No | We used Py Torch package ptflops to calculate the number of flops. While it mentions PyTorch and ptflops, it does not specify version numbers for these software components. |
| Experiment Setup | Yes | We keep the rank of our decomposition equal to 2. ... We set λ1 = λ2 = 0.01 and increase these parameters by 0.002 every 5 epochs. We continue training for 50 epochs. ... The rank of the decomposition under our algorithm is 5 for all the layers in all the experiments. ... The model is trained using this dataset for 300 epochs. We keep the rank of our algorithm equal to 4. |