OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition
Authors: Stephen Zhang, Vardan Papyan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate OATS on recent large language models (LLMs) Phi-3 (Abdin et al., 2024) and Llama3 (Dubey et al., 2024) and vision transformers Google s Vi T (Wu et al., 2020) and Dino V2 (Oquab et al., 2023) demonstrating that OATS achieves new state-of-the-art performance across a wide range of commonly employed performance metrics. Furthermore, by combining structured pruning with unstructured pruning, OATS accelerates CPU inference across all levels of compression when compared to models that utilize just unstructured pruning. |
| Researcher Affiliation | Academia | Stephen Zhang University of Toronto EMAIL Vardan Papyan University of Toronto EMAIL |
| Pseudocode | Yes | Algorithm 1 ALTERNATINGTHRESHOLDING 1: Inputs: 2: Weight Matrix: W Rdout din 3: Iterations: N 4: Rank: r 5: Nonzeros: k 6: Procedure: 7: S = 0 8: for t = 1 to N do 9: L = TRUNCATEDSVD(W S, r) 10: S = HARDTHRESHOLD(W L, k) 11: end for 12: return: S, L. Algorithm 2 OATS 1: Inputs: 2: Layer Inputs Propagated Through Prior Compressed Layers: Xℓ RB din 3: Layer Matrix: W ℓ Rdout din 4: Compression Rate: ρ 5: Rank Ratio: κ 6: Iterations: N 7: Procedure: 8: r j κ (1 ρ) dout din k , k (1 κ) (1 ρ) dout din diag(X X) 10: L, S ALTERNATINGTHRESHOLDING(W D, N, r, k) 11: W (L + S)D 1 12: return: Xℓ+1 XℓW |
| Open Source Code | Yes | Our code is available at: https://github.com/stephenqz/OATS. |
| Open Datasets | Yes | We evaluate OATS on two state-of-the-art families of LLMs: Phi-3 (Abdin et al., 2024) and Llama-3 (Dubey et al., 2024). We utilize LM Harness developed by Gao et al. (2024) to evaluate fiveshot performance on the Massive Multitask Language Understanding benchmark by Hendrycks et al. (2021), zero-shot performance on eight tasks, and language generation on Wiki Text-2. our calibration data consists of 128 sequences of length 2048 sampled from the first shard of the C4 training set (Raffel et al., 2020). We run experiments on Google s Vi T-Base (Wu et al., 2020)... and Dino V2-Giant (Oquab et al., 2023). Figure 4 depicts the attention rollout for various images in the Microsoft COCO dataset (Lin et al., 2014). |
| Dataset Splits | Yes | Our calibration data consists of 128 sequences of length 2048 sampled from the first shard of the C4 training set (Raffel et al., 2020). To ensure consistency, we utilize the same calibration data for all pruning algorithms that we benchmark. A subset of 2048 images from the training set of Image Net is used for calibration and is maintained consistent across all pruning experiments. All linear layers in a transformer block are pruned uniformly to achieve the desired sparsity rate. We exclude pruning any linear layers that are present in the model head and embeddings which conforms with prior works by Frantar & Alistarh (2023), Sun et al. (2024b), and Zhang et al. (2024b). We evaluate top-1 accuracy on the validation set of Image Net (Russakovsky et al., 2015). |
| Hardware Specification | Yes | We run end-to-end inference on a compressed Phi-3 Medium 15B model for a single token on an Intel Xeon Gold 6148 CPU @ 2.40GHz with 32 cores. All experiments utilized a single NVIDIA A40 with 48GB of GPU memory. For example, the time needed per transformer block of Llama-3 70B can be reduced to 71.10 seconds by compressing in parallel across four NVIDIA A40 GPUs. |
| Software Dependencies | No | We benchmark the CPU speedup of OATS over its competitors using the Deep Sparse Inference Engine developed by Neural Magic (2021). We utilize Huggingface s Transformers library to implement the large language models and vision transformers for our experiments (Wolf et al., 2020). The paper mentions libraries and tools like HuggingFace Transformers and Deep Sparse Inference Engine, but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Parameters Phi-3 Llama-3 Iterations 80 80 Rank Ratio 25% 30% Table 1: Hyperparameters utilized for OATS across model families. We benchmark our algorithm across a wide range of compression rates: {0.3, 0.4, 0.5, 0.6}. For compression rates at or below 0.5, we compress all transformer blocks uniformly. At the higher compression rate of 0.6, we utilize Outlier Weighed Layerwise Sparsity Ratios (OWL) proposed by Yin et al. (2024b). All OATS experiments use a rank ratio of κ=20% and N=80 iterations. We utilize a blocksize of 128 across all experiments and a Hessian dampening of 0.01 and 0.1. All DSNo T experiments were run with 50 iterations and an update threshold of 0.1. |