MagicPIG: LSH Sampling for Efficient LLM Generation

Authors: Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 5, we show the empirical evaluation results of the performance of MAGICPIG, demonstrating the accuracy and efficiency. While maintaining high accuracy for diverse tasks, MAGICPIG can improve serving throughput by 1.5–5× (A100, L20, RTX 4090) and can achieve 54ms decoding latency on a single RTX 4090 for Llama-3.1-8B-Instruct (Dubey et al., 2024) with 96K context.
Researcher Affiliation Collaboration Zhuoming Chen1, Ranajoy Sadhukhan1, Zihao Ye2, Yang Zhou1, Jianyu Zhang34, Niklas Nolte4 Yuandong Tian4, Matthijs Douze4, Leon Bottou4, Zhihao Jia1, Beidi Chen1 1Carnegie Mellon University 2University of Washington 3New York University 4Meta AI
Pseudocode Yes Algorithm 1: MAGICPIG Decoding
Open Source Code No The paper does not contain an explicit statement about releasing the code for the methodology, nor does it provide a specific link to a code repository.
Open Datasets Yes Our experiments are based on Llama (AI@Meta, 2024; Dubey et al., 2024; Touvron et al., 2023) models. Three types of tasks are included, which are 3 mid-context comprehensive tasks from lm-evalharness (Gao et al., 2021) (GSM8K-Co T (Cobbe et al., 2021), MMLU-Flan-Cot-Fewshot (Hendrycks et al., 2020) and COQA (Reddy et al., 2019)), and 6 long context tasks from (Bai et al., 2023) (QASPER (Dasigi et al., 2021), LCC, Repobench-P (Liu et al., 2023), Trivia QA (Joshi et al., 2017), PRE and TREC (Li & Roth, 2002; Hovy et al., 2001)) and 13 synthetic tasks from RULER (Hsieh et al., 2024) (with 50 examples per task).
Dataset Splits Yes The initial 4 tokens and local 64 (for Long Bench (Bai et al., 2023) and RULER (Hsieh et al., 2024)) or 24 (for lm-eval-harness (Gao et al., 2021)) tokens as well as layer-{0,16} are statically preserved. ... 13 synthetic tasks from RULER (Hsieh et al., 2024) (with 50 examples per task).
Hardware Specification Yes We evaluate our system performance on 3 serving settings. (1) 80GB GPU (A100) and 34B model (Code Llama-34B) (Rozière et al., 2024) with 16K contexts; (2) 48GB GPU (L20) and 13B model (Code Llama-13B) (Rozière et al., 2024) with 16K contexts; (3) 24GB GPU2 (e.g. RTX 4090) and 8B model (Llama-3.1-8B) (Dubey et al., 2024) with 96K contexts. ... Our CPU is Intel Platinum 8480+ for A100 and Intel 8563C for L20.
Software Dependencies No Our system’s GPU part is implemented in native Pytorch (Paszke et al., 2019) and the CPU part in FBGEMM (Khudia et al., 2021) in bfloat16 precision. While software names are mentioned, specific version numbers for PyTorch and FBGEMM are not provided.
Experiment Setup Yes Baselines. Besides full attention, Quest (Tang et al., 2024) and its variants are used as baselines. In its default setting, Quest uses a page size of 16, i.e. 1/16 of the full attention cost. To compare the methods fairly in the low computation budget regime, we also evaluate Quest with page size 32 and 64 and make sure at least one page is selected in every test example. The initial 4 tokens and local 64 (for Long Bench (Bai et al., 2023) and RULER (Hsieh et al., 2024)) or 24 (for lm-eval-harness (Gao et al., 2021)) tokens as well as layer-{0,16} are statically preserved. ... The config (K,L) is hyper-parameter of LSH for MAGICPIG or page size and ratio of selected pages for Quest (Tang et al., 2024).