reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training Free Exponential Context Extension via Cascading KV Cache

Authors: Jeff Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach outperforms linear caching baselines across key benchmarks, including streaming perplexity, question answering, book summarization, and passkey retrieval, where it retains better retrieval accuracy at 1M tokens after four doublings of the cache size of 65K. Additionally, our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
Researcher Affiliation	Collaboration	1Jeffrey Willette, 1Heejun Lee, 1,2Youngwan Lee 1KAIST AI, 2ETRI, South Korea EMAIL,EMAIL Myeongjae Jeon POSTECH, South Korea EMAIL Sung Ju Hwang KAIST AI, Deepauto.ai, South Korea EMAIL
Pseudocode	Yes	Algorithm 1 Strided Prefill... Algorithm 2 Cascading Sink Cache Algorithm (repeat for keys and values)... Algorithm 3 Token Selection EMA Accumulation (in the context of Flash Attention 2 kernel)
Open Source Code	Yes	In order to aid in reproducibility of our experiments, we provide our code which is available online2. 2https://github.com/jeffwillette/cascading_kv_cache
Open Datasets	Yes	We conduct experiments on streaming books (PG19) (Rae et al., 2019), Long Context Understanding (Long Bench) (Bai et al., 2023b), book summarization (Booksum) (Kry sci nski et al., 2021), and 1M token passkey retrieval.
Dataset Splits	Yes	We measure perplexity on the PG19 test set (full-length books). Each book is streamed independently from start to finish without concatenation. ... For passkey retrieval, we evaluate Streaming LLM and our Cascading KV Cache, by generating a random 5 digit passkey which is hidden in a random point in the total sequence length. ... We evaluate our method against other linear scaling models on the same subset of tasks as (Xiao et al., 2023) in the Long Bench long context understanding benchmark (Bai et al., 2023b).
Hardware Specification	Yes	For experiments utilizing 8B model sizes, we use one NVIDIA A6000 (49GB) GPU, and for experiments utilizing 70B model sizes, we utilize 4 NVIDIA A100 GPUs.
Software Dependencies	No	Our implementation utilizes circular buffers and the Triton compiler (Tillet et al., 2019) to create an efficient CUDA kernel for the caching operation. ... our method utilizes a modified triton (Tillet et al., 2019) Flash Attention 2 kernel.
Experiment Setup	Yes	In all our experiments, we keep the first 64 initial tokens as attention sinks. ... We set the EMA parameter γ = 0.9999. We use the cascade token acceptance policy depicted in Figure 4 and Equation (4), where each successive sub-cache accepts half of the tokens from the previous sub-cache. Unless otherwise indicated, our models use four cascading sub-caches. ... We use three cache sizes of 16K, 32K, and 65K with a strided prefill of 1K.