An Efficient Training Algorithm for Models with Block-wise Sparsity

Authors: Ding Zhu, Zhiqun Zuo, Mohammad Mahdi Khalili

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive empirical and theoretical analyses show that our algorithms can decrease the computation and memory costs significantly without a performance drop compared to baselines. Through extensive empirical study, we show that in some cases, our proposed method can reduce the number of training parameters and training FLOPs by 97% with a minimal accuracy drop.
Researcher Affiliation Academia Ding Zhu EMAIL Department of Computer Science and Engineering The Ohio State University Zhiqun Zuo EMAIL Department of Computer Science and Engineering The Ohio State University Mohammad Mahdi Khalili EMAIL Department of Computer Science and Engineering The Ohio State University
Pseudocode No The paper includes mathematical formulations, propositions, and proofs, but does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper does not explicitly state that source code for the described methodology is released or provide a link to a code repository. It only mentions the use of the PyTorch package.
Open Datasets Yes The MNIST dataset contains 60000 training images and 10000 testing images which are gray pictures of digits 0 to 9. The size of each image is 28 28 pixels. ... We conduct an experiment with our approach with the Vi T-tiny, Vi T-base(Dosovitskiy et al., 2021; Touvron et al., 2021) and Swin-transformer Tiny (Liu et al., 2021b) on CIFAR-100 image classification dataset (He et al., 2016).
Dataset Splits Yes The MNIST dataset contains 60000 training images and 10000 testing images which are gray pictures of digits 0 to 9. ... The dataset has 60 thousand pictures of 100 different categories. The model is trained using this dataset for 300 epochs.
Hardware Specification Yes we used a server with 64 CPUs of AMD EPYC 7313 16-Core Processor. The server has 8 RTX A5000 GPUs, with 24GB memory for each one.
Software Dependencies No We used Py Torch package ptflops to calculate the number of flops. While it mentions PyTorch and ptflops, it does not specify version numbers for these software components.
Experiment Setup Yes We keep the rank of our decomposition equal to 2. ... We set λ1 = λ2 = 0.01 and increase these parameters by 0.002 every 5 epochs. We continue training for 50 epochs. ... The rank of the decomposition under our algorithm is 5 for all the layers in all the experiments. ... The model is trained using this dataset for 300 epochs. We keep the rank of our algorithm equal to 4.