Training-Free Activation Sparsity in Large Language Models
Authors: James Liu, Pragaash Ponnusamy, Tianle Cai, placeholder, Yoon Kim, Ben Athiwaratkun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate TEAL on the Mistral (Jiang et al., 2023), Llama-2 (Touvron et al., 2023), and Llama-3 (Dubey et al., 2024) families. We measure the performance of sparsified models on language modeling using the Wiki Text (Merity et al., 2016) validation set, and on an aggregate of six downstream tasks using the Eleuther AI LM Harness (Gao et al., 2023)... Main Results. TEAL is performant, as shown in Tables 1 and 2, showcasing near zero degradation at 25%, and minimal degradation at 40% sparsity. ... We benchmark TEAL s end-to-end single-batch decoding latency by integrating it with GPT-Fast (Py Torch, 2024). |
| Researcher Affiliation | Collaboration | James Liu1,2 Pragaash Ponnusamy2 Tianle Cai3 Han Guo1 Yoon Kim1 Ben Athiwaratkun2 1 Massachusetts Institute of Technology 2 Together AI 3 Princeton University Correspondence to EMAIL. Work done during an internship at Together AI. |
| Pseudocode | Yes | Algorithm 1 Block-wise Greedy Optimization |
| Open Source Code | Yes | https://github.com/FasterDecoding/TEAL |
| Open Datasets | Yes | We collect activations of Llama-3-8B (Dubey et al., 2024) sampled from C4 (Raffel et al., 2023)... We measure the performance of sparsified models on language modeling using the Wiki Text (Merity et al., 2016) validation set, and on an aggregate of six downstream tasks using the Eleuther AI LM Harness (Gao et al., 2023)... |
| Dataset Splits | Yes | For language modeling, we evaluate all models on the same 128 random samples, using a 2048-token context and 512-token evaluation window. ... We evaluate on Wiki Text and use the greedily optimized sparsities described in Section 4.3. ... We use the standard inference benchmarking setup in GPT-Fast, which passes in roughly 5 input tokens and generates at most 200 output tokens. |
| Hardware Specification | Yes | Figure 3 shows a small speed-up on A6000, and a larger speed-up on A100 over the Deja Vu kernel. ... Cost. ...less than one GPU-hour on an A100 for Llama-3-8B. ... We utilize tensor parallelism for Llama-3-70B: TP2 for A100, and TP4 for A6000. Our GPU power limit settings are 500W and 300W for A100 and A6000 respectively. |
| Software Dependencies | No | The paper mentions "Triton-based (Tillet et al., 2019) kernel" and "GPT-Fast (Py Torch, 2024)" but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For language modeling, we evaluate all models on the same 128 random samples, using a 2048-token context and 512-token evaluation window. ... We use the standard inference benchmarking setup in GPT-Fast, which passes in roughly 5 input tokens and generates at most 200 output tokens. ... Our GPU power limit settings are 500W and 300W for A100 and A6000 respectively. ... We fine-tune Llama-3-8B using Lo RA (Hu et al., 2021) with a rank of 32 (approximately 1% of parameters are trainable) and a learning rate of 0.0002. The model is fine-tuned on 30M tokens from C4. |