Sparse Learning for State Space Models on Mobile

Authors: Xuan Shen, Hangyu Zheng, Yifan Gong, Zhenglun Kong, Changdi Yang, Zheng Zhan, Yushu Wu, Xue Lin, Yanzhi Wang, Pu Zhao, Wei Niu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our method achieves superior task performance compared to other semi-structured pruning methods and achieves up-to 7 speedup compared to llama.cpp framework on mobile devices. ... We implement the sparse model with our proposed kernels on mobile devices and achieve a practical on-device speedup of up to 7 compared to llama.cpp.
Researcher Affiliation Academia 1Northeastern University 2University of Georgia 3Harvard University {shen.xu}@northeastern.edu EMAIL
Pseudocode Yes The algorithm for sparse learning is presented in Algorithm 1, featuring re as the effectiveness loss ratio and rt as the target loss ratio. The max index retrieves the index of the maximum value along a specified dimension. The output Noutput generated by the algorithm represents the optimal sparse structure using the proposed kernel.
Open Source Code No The paper describes the implementation of their method and compares it against other open-source frameworks like llama.cpp, but does not explicitly state that their own code for the proposed methodology is publicly available, nor does it provide a link.
Open Datasets Yes We evaluate the task performance on multiple common sense reasoning datasets including LAMBADA (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), PIQA (Bisk etm al., 2020), Arceasy (Clark et al., 2018), Arc-challenge (Clark et al., 2018), and Wino Grade (Sakaguchi et al., 2021).
Dataset Splits No The paper states it uses several common sense reasoning datasets but does not explicitly specify the training, test, or validation splits for these datasets. It mentions 'calibration with only 128 training samples' for weight compensation, which is not a general dataset split for model evaluation.
Hardware Specification Yes The latency evaluations are conducted on a Oneplus 11 mobile device equipped with Snapdragon 8 Gen 2 So C, featuring an octa-core Kryo CPU and Qualcomm Adreno 740 GPU with 16GB memory. ... Additionally, the results tested on another edge device, Xiaomi 6, which is equipped with a Snapdragon 835, featuring an octa-core CPU and an Adreno GPU with 6GB of memory, are included in Table A7.
Software Dependencies No The paper mentions comparing against 'llama.cpp (contributors, 2023a)' and that their compiler generates 'C++/Assembly code for mobile CPUs and Open CL code for mobile GPUs'. However, no specific version numbers for llama.cpp, the compiler itself, or any other software libraries/dependencies are provided.
Experiment Setup Yes For sparse learning, we train over 1000 steps using the SGD optimizer at a starting learning rate of 0.1, decaying it by 0.1 every 200 steps. The original model weights are frozen and sparse learning for Mamba-2.8B takes 50 mins on a A6000 GPU. ... The batch size is set to 1 for all models unless otherwise specified. Each experiment is repeated 50 times.