reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

Authors: Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The evaluation is conducted on six Transformer-based long-context LLMs: Llama-3.1-8B/70B, Qwen2.5-72B/32B, Llama-3-8B-262K, and GLM-4-9B. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling computation perform robustly. ... Our experimental results reveal the following insights: 1) Sub-O(n) memory is almost infeasible in multi-turn decoding, as shown in Fig. 3. Sparse decoding methods (sub-O(n) memory) perform well on the first query but lose accuracy in subsequent requests. ... Main Results Tables 4, 10, and Fig. 9 present the performance of various long-context methods across tasks and shared context modes in different base LLMs.
Researcher Affiliation	Collaboration	Yucheng Li , Huiqiang Jiang , Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu Microsoft Corporation, University of Surrey EMAIL,EMAIL
Pseudocode	No	The paper describes various methods (e.g., A-shape, Tri-shape, MInference, Streaming LLM) in prose and presents their configurations in Table 9, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper provides a general project URL (https://aka.ms/SCBench) and mentions several third-party open-source tools (e.g., vLLM, Hugging Face, Flash Attention-2) and their versions, as well as modifications to some kernels. However, it does not contain an unambiguous statement that the authors are releasing the source code specifically for their proposed methodology (SCBench or Tri-shape) nor a direct link to a code repository for their work.
Open Datasets	Yes	We use datasets from Big-Bench Hard (Srivastava et al., 2023) to evaluate many-shot ICL capabilities. This includes three sub-tasks: date understanding, salient error translation detection, and tracking seven shuffled objects. ... We extended the math find task from Infinite Bench (Zhang et al., 2024a), expanding it from finding only the maximum value to multiple statistical values. ... Unlike the original Repo QA benchmark (Liu et al., 2024c), our inputs extend to 64K tokens...
Dataset Splits	Yes	SCBench features 12 tasks assessing four long-context abilities: string retrieval, semantic retrieval, global information processing, and multi-tasking, across two shared context modes multi-turn and multi-request. ... In total, SCBench includes 931 multi-turn sessions with 4,853 queries, averaging 5 turns per session. Task statistics are provided in Table 2, with examples and configurations in Table 3.
Hardware Specification	Yes	For stability, all experiments used greedy decoding in BFloat16 on four NVIDIA A100 GPUs. ... Specifically, we use tensor parallel when testing models larger than 7B parameters, with 8A100 40GB machines or 4H100 80GB machines.
Software Dependencies	Yes	We evaluated models via Hugging Face or v LLM with Flash Attention-2 (Dao, 2024) and leveraged MInference (Jiang et al., 2024) to reduce GPU memory overhead. More details on these models and infrastructure are in D.1. ... v LLM-0.52 is used as the inference framework in our testing, and the flash_attn-2.5 kernels were overwritten with our own kernels. For KV Cache compression, our implementation is based on the huggingface implementation of Sink Cache for Streaming LLM3, and official implementation of 4. For SSMs and Mamba-Attention Hybrid models, we use the triton version of mamba5 kernels together with causal-conv1d-1.46.
Experiment Setup	Yes	Models & Implementation Details We selected six open-source long-context LLMs: Llama3.1-8B/70B (Dubey et al., 2024), Qwen2.5-72B/32B (Team, 2024), Llama-3-8B-262K (Gradient, 2024), and GLM-4-9B-1M (GLM et al., 2024), along with two gated linear models: Codestal Mamba 7B (team, 2024) and Jamba-1.5-Mini (Lieber et al., 2024). ... All methods were tested on Transformer-based long-context LLMs, except Codestral-Mamba and Jamba. We also report KV cache size, pre-filling and decoding complexity, and whether efficient operations are applied (details in 2). More details are shown in D.2. ... Table 9: Configurations of long-context methods in SCBench. ... Sparse Attention Tri-Shape num local: 4096, num initial: 128, num dense rows: 128 ... LLMLingua-2 compression rate: 0.333