ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction
Authors: Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, Yun Liang
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results show that ARKVALE performs well on various long context tasks with negligible accuracy loss under 2k 4k cache budget and can improve decoding latency up to 2.2 (1.7 in average) and batching throughput up to 4.6 (3.5 in average). |
| Researcher Affiliation | Collaboration | Renze Chen Peking University EMAIL Zhuofeng Wang Peking University EMAIL Beiquan Cao Peking University EMAIL Tong Wu Peking University EMAIL Size Zheng Peking University EMAIL Xiuhong Li Peking University EMAIL Xuechao Wei Peking University EMAIL Shengen Yan Infinigence-AI EMAIL Meng Li Peking University EMAIL Yun Liang Peking University EMAIL |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is now available at https://github.com/pku-liang/Ark Vale. |
| Open Datasets | Yes | We apply our method to Long Chat-7b-v1.5-32k [1] and use 6 datasets from Long Bench [9] for benchmarking: Hotpot QA [59], Narrative QA [35], Qasper [20], Gov Report [28], Trivia QA [30], and Passage Retrieval [9], along with the passkey-retrieval tasks. |
| Dataset Splits | No | The paper mentions using datasets for benchmarking and simulation, but it does not explicitly provide details about specific training/validation/test splits, their percentages, or how they were derived for the experiments. |
| Hardware Specification | Yes | Our experiment platform comprises an Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz and an NVIDIA A100 80GB PCIe GPU. |
| Software Dependencies | Yes | The software stack includes CUDA version 12.3, Py Torch [41, 8] version 2.3.0, and Hugging Face Transformers [57] version 4.40.0. We implement ARKVALE on top of Huggingface Transformers, with CUTLASS [54], Flash Infer [60], and RAFT [44] for certain kernels. |
| Experiment Setup | Yes | We configure four cache budget settings: 4096, 2048, 1024, and 512. ...with settings of a batch-size=4, page-size=32, and KV cache budgets of 512, 1024, 2048, and 4096. for page-size p and cache-capacity c (tokens), we set k = min(C, c/2)/p, where C is a hyper-parameters (default C = 40 * 32 = 1280). |