reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training-Free Activation Sparsity in Large Language Models

Authors: James Liu, Pragaash Ponnusamy, Tianle Cai, placeholder, Yoon Kim, Ben Athiwaratkun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate TEAL on the Mistral (Jiang et al., 2023), Llama-2 (Touvron et al., 2023), and Llama-3 (Dubey et al., 2024) families. We measure the performance of sparsified models on language modeling using the Wiki Text (Merity et al., 2016) validation set, and on an aggregate of six downstream tasks using the Eleuther AI LM Harness (Gao et al., 2023)... Main Results. TEAL is performant, as shown in Tables 1 and 2, showcasing near zero degradation at 25%, and minimal degradation at 40% sparsity. ... We benchmark TEAL s end-to-end single-batch decoding latency by integrating it with GPT-Fast (Py Torch, 2024).
Researcher Affiliation	Collaboration	James Liu1,2 Pragaash Ponnusamy2 Tianle Cai3 Han Guo1 Yoon Kim1 Ben Athiwaratkun2 1 Massachusetts Institute of Technology 2 Together AI 3 Princeton University Correspondence to EMAIL. Work done during an internship at Together AI.
Pseudocode	Yes	Algorithm 1 Block-wise Greedy Optimization
Open Source Code	Yes	https://github.com/FasterDecoding/TEAL
Open Datasets	Yes	We collect activations of Llama-3-8B (Dubey et al., 2024) sampled from C4 (Raffel et al., 2023)... We measure the performance of sparsified models on language modeling using the Wiki Text (Merity et al., 2016) validation set, and on an aggregate of six downstream tasks using the Eleuther AI LM Harness (Gao et al., 2023)...
Dataset Splits	Yes	For language modeling, we evaluate all models on the same 128 random samples, using a 2048-token context and 512-token evaluation window. ... We evaluate on Wiki Text and use the greedily optimized sparsities described in Section 4.3. ... We use the standard inference benchmarking setup in GPT-Fast, which passes in roughly 5 input tokens and generates at most 200 output tokens.
Hardware Specification	Yes	Figure 3 shows a small speed-up on A6000, and a larger speed-up on A100 over the Deja Vu kernel. ... Cost. ...less than one GPU-hour on an A100 for Llama-3-8B. ... We utilize tensor parallelism for Llama-3-70B: TP2 for A100, and TP4 for A6000. Our GPU power limit settings are 500W and 300W for A100 and A6000 respectively.
Software Dependencies	No	The paper mentions "Triton-based (Tillet et al., 2019) kernel" and "GPT-Fast (Py Torch, 2024)" but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For language modeling, we evaluate all models on the same 128 random samples, using a 2048-token context and 512-token evaluation window. ... We use the standard inference benchmarking setup in GPT-Fast, which passes in roughly 5 input tokens and generates at most 200 output tokens. ... Our GPU power limit settings are 500W and 300W for A100 and A6000 respectively. ... We fine-tune Llama-3-8B using Lo RA (Hu et al., 2021) with a rank of 32 (approximately 1% of parameters are trainable) and a learning rate of 0.0002. The model is fine-tuned on 30M tokens from C4.