EPIC: Efficient Position-Independent Caching for Serving Large Language Models
Authors: Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Zhang Qin, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that EPIC achieves up to 8 improvements in Time To-First-Token (TTFT) and 7 throughput gains over existing systems, with negligible or no accuracy loss. We implement the EPIC serving system with the Lego Link algorithm based on one of the most widely used inference frameworks, v LLM (Kwon et al., 2023). We evaluate EPIC against the state-of-the-art Cache Blend (Yao et al., 2025) system, across six tasks with distinct characteristics and three model architectures with diverse training recipes. |
| Researcher Affiliation | Collaboration | 1SCS, Peking University; Key Lab of HCST (PKU), MOE, China 2School of Computer Science, Nanjing University, Nanjing, China 3Huawei Cloud, Shanghai, China 4Key Lab of HCST (PKU), MOE; SCS, Peking University, China. Correspondence to: Tao Xie <EMAIL>, Yizhou Shan <EMAIL>. |
| Pseudocode | No | Assuming that we have selected k (k tokens from each chunk plus the user query) tokens from a total of N (prompt length) tokens, we recompute them as follows. First, we obtain the embedding matrix E (with shape (k , d)) of the k tokens, where d is the hidden size. Second, at layer i, we compute the new K, Q, and V matrices (each with shape (k , d)) for these k tokens: Q = EWQ, K = EWK, V = EWV , where WQ, WK, and WV are model parameters with shape (d, d)6. Third, we expand the K and V matrices by incorporating the cached KV vectors of the N k unselected tokens at correct positions, forming Kexp and Vexp (both with shape (N, d)). Fourth, we compute the attention matrix A (with shape (k , N)) by multiplying Q (with shape (k , d)) with KT exp (with shape (d, N)), allowing the k tokens to attend to all N tokens: A = softmax(QKT exp MASK) (1) where MASK assures that the k tokens attend to only tokens before them. Finally, we multiply A (with shape (k , N)) with Vexp (with shape (N, d)) to obtain the output (or input to the next layer, with shape (k , d)): O = AVexp WO, where WO is a matrix with shape (d, d). The paper describes the algorithm steps in paragraph form and provides a mathematical equation, but does not present a structured pseudocode or algorithm block. |
| Open Source Code | Yes | The code is available at: https://github.com/Derek HJH/epic. |
| Open Datasets | Yes | Following Cache Blend, we evaluate on four Long Bench datasets (Bai et al., 2024): 2Wiki MQA (multidocument question answering), Mu Si Que (multi-document question answering), SAMSum (few-shot instruction following), and Multi News (multi-document summarization). We also include Hotpot QA (multi-document question answering) from Long Bench... To evaluate long-context retrieval, we include the Needle in a Haystack dataset (LLM)... Needle In A Haystack. https://github.com/gkamradt/LLMTest_Needle In AHaystack. |
| Dataset Splits | No | All datasets contain 200 test cases, with the distribution of prompt (prefill) lengths and answer (decode) lengths shown in Figure 5. Synchronous workload. To evaluate the accuracy latency trade-off without interference from concurrent requests, we process test cases sequentially, ensuring that each completes before the next begins. First, for each test case, we compile all immutable chunks to obtain their corresponding cache IDs. For the Long Bench dataset (excluding SAMSum), we treat each document as a chunk. For SAMSum and Needlein-a-Haystack, we split all immutable tokens into 512-token chunks. Second, we send a request containing the cache IDs of cached chunks along with the query to obtain the response. The paper mentions using "200 test cases" and describes how immutable tokens are chunked for processing, but does not specify explicit training, validation, or test splits for model training or evaluation. |
| Hardware Specification | Yes | Environment. We run experiments on a single NVIDIA A100 server with one A100-80GB GPU available. The server has 128-core Intel(R) Xeon(R) Platinum 8358P CPU@2.60GHz with 2 hyperthreading and 1TB DRAM. |
| Software Dependencies | Yes | Implementation. We implement EPIC based on v LLM 0.4.1 (Kwon et al., 2023), with 2K lines of code in Python. We use Ubuntu 20.04 with Linux kernel 5.16.7 and CUDA 12.6. |
| Experiment Setup | Yes | Baselines. We compare Lego Link with the other three recomputation algorithms in Figure 4: FR, Naive, and Cache Blend (Yao et al., 2025). Additionally, we evaluate different variants of Cache Blend, denoted as Cache Blend-r, where r represents the ratio of tokens recomputed. Similarly, we evaluate different variants of Lego Link, denoted as Lego Link-k, where k refers to each chunk s first k tokens. We send requests of varying context lengths with a fixed chunk size (512 tokens) synchronously to EPIC, yielding two observations from the results (Figure 9). |