SCBench: A KV Cache-Centric Analysis of Long-Context Methods
Authors: Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The evaluation is conducted on six Transformer-based long-context LLMs: Llama-3.1-8B/70B, Qwen2.5-72B/32B, Llama-3-8B-262K, and GLM-4-9B. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling computation perform robustly. ... Our experimental results reveal the following insights: 1) Sub-O(n) memory is almost infeasible in multi-turn decoding, as shown in Fig. 3. Sparse decoding methods (sub-O(n) memory) perform well on the first query but lose accuracy in subsequent requests. ... Main Results Tables 4, 10, and Fig. 9 present the performance of various long-context methods across tasks and shared context modes in different base LLMs. |
| Researcher Affiliation | Collaboration | Yucheng Li , Huiqiang Jiang , Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu Microsoft Corporation, University of Surrey EMAIL,EMAIL |
| Pseudocode | No | The paper describes various methods (e.g., A-shape, Tri-shape, MInference, Streaming LLM) in prose and presents their configurations in Table 9, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper provides a general project URL (https://aka.ms/SCBench) and mentions several third-party open-source tools (e.g., vLLM, Hugging Face, Flash Attention-2) and their versions, as well as modifications to some kernels. However, it does not contain an unambiguous statement that the authors are releasing the source code specifically for their proposed methodology (SCBench or Tri-shape) nor a direct link to a code repository for their work. |
| Open Datasets | Yes | We use datasets from Big-Bench Hard (Srivastava et al., 2023) to evaluate many-shot ICL capabilities. This includes three sub-tasks: date understanding, salient error translation detection, and tracking seven shuffled objects. ... We extended the math find task from Infinite Bench (Zhang et al., 2024a), expanding it from finding only the maximum value to multiple statistical values. ... Unlike the original Repo QA benchmark (Liu et al., 2024c), our inputs extend to 64K tokens... |
| Dataset Splits | Yes | SCBench features 12 tasks assessing four long-context abilities: string retrieval, semantic retrieval, global information processing, and multi-tasking, across two shared context modes multi-turn and multi-request. ... In total, SCBench includes 931 multi-turn sessions with 4,853 queries, averaging 5 turns per session. Task statistics are provided in Table 2, with examples and configurations in Table 3. |
| Hardware Specification | Yes | For stability, all experiments used greedy decoding in BFloat16 on four NVIDIA A100 GPUs. ... Specifically, we use tensor parallel when testing models larger than 7B parameters, with 8*A100 40GB machines or 4*H100 80GB machines. |
| Software Dependencies | Yes | We evaluated models via Hugging Face or v LLM with Flash Attention-2 (Dao, 2024) and leveraged MInference (Jiang et al., 2024) to reduce GPU memory overhead. More details on these models and infrastructure are in D.1. ... v LLM-0.52 is used as the inference framework in our testing, and the flash_attn-2.5 kernels were overwritten with our own kernels. For KV Cache compression, our implementation is based on the huggingface implementation of Sink Cache for Streaming LLM3, and official implementation of 4. For SSMs and Mamba-Attention Hybrid models, we use the triton version of mamba5 kernels together with causal-conv1d-1.46. |
| Experiment Setup | Yes | Models & Implementation Details We selected six open-source long-context LLMs: Llama3.1-8B/70B (Dubey et al., 2024), Qwen2.5-72B/32B (Team, 2024), Llama-3-8B-262K (Gradient, 2024), and GLM-4-9B-1M (GLM et al., 2024), along with two gated linear models: Codestal Mamba 7B (team, 2024) and Jamba-1.5-Mini (Lieber et al., 2024). ... All methods were tested on Transformer-based long-context LLMs, except Codestral-Mamba and Jamba. We also report KV cache size, pre-filling and decoding complexity, and whether efficient operations are applied (details in 2). More details are shown in D.2. ... Table 9: Configurations of long-context methods in SCBench. ... Sparse Attention Tri-Shape num local: 4096, num initial: 128, num dense rows: 128 ... LLMLingua-2 compression rate: 0.333 |