Training Free Exponential Context Extension via Cascading KV Cache

Authors: Jeff Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach outperforms linear caching baselines across key benchmarks, including streaming perplexity, question answering, book summarization, and passkey retrieval, where it retains better retrieval accuracy at 1M tokens after four doublings of the cache size of 65K. Additionally, our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
Researcher Affiliation Collaboration 1Jeffrey Willette, 1Heejun Lee, 1,2Youngwan Lee 1KAIST AI, 2ETRI, South Korea EMAIL,EMAIL Myeongjae Jeon POSTECH, South Korea EMAIL Sung Ju Hwang KAIST AI, Deepauto.ai, South Korea EMAIL
Pseudocode Yes Algorithm 1 Strided Prefill... Algorithm 2 Cascading Sink Cache Algorithm (repeat for keys and values)... Algorithm 3 Token Selection EMA Accumulation (in the context of Flash Attention 2 kernel)
Open Source Code Yes In order to aid in reproducibility of our experiments, we provide our code which is available online2. 2https://github.com/jeffwillette/cascading_kv_cache
Open Datasets Yes We conduct experiments on streaming books (PG19) (Rae et al., 2019), Long Context Understanding (Long Bench) (Bai et al., 2023b), book summarization (Booksum) (Kry sci nski et al., 2021), and 1M token passkey retrieval.
Dataset Splits Yes We measure perplexity on the PG19 test set (full-length books). Each book is streamed independently from start to finish without concatenation. ... For passkey retrieval, we evaluate Streaming LLM and our Cascading KV Cache, by generating a random 5 digit passkey which is hidden in a random point in the total sequence length. ... We evaluate our method against other linear scaling models on the same subset of tasks as (Xiao et al., 2023) in the Long Bench long context understanding benchmark (Bai et al., 2023b).
Hardware Specification Yes For experiments utilizing 8B model sizes, we use one NVIDIA A6000 (49GB) GPU, and for experiments utilizing 70B model sizes, we utilize 4 NVIDIA A100 GPUs.
Software Dependencies No Our implementation utilizes circular buffers and the Triton compiler (Tillet et al., 2019) to create an efficient CUDA kernel for the caching operation. ... our method utilizes a modified triton (Tillet et al., 2019) Flash Attention 2 kernel.
Experiment Setup Yes In all our experiments, we keep the first 64 initial tokens as attention sinks. ... We set the EMA parameter γ = 0.9999. We use the cascade token acceptance policy depicted in Figure 4 and Equation (4), where each successive sub-cache accepts half of the tokens from the previous sub-cache. Unless otherwise indicated, our models use four cascading sub-caches. ... We use three cache sizes of 16K, 32K, and 65K with a strided prefill of 1K.