ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction

Authors: Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, Yun Liang

NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results show that ARKVALE performs well on various long context tasks with negligible accuracy loss under 2k 4k cache budget and can improve decoding latency up to 2.2 (1.7 in average) and batching throughput up to 4.6 (3.5 in average).
Researcher Affiliation Collaboration Renze Chen Peking University EMAIL Zhuofeng Wang Peking University EMAIL Beiquan Cao Peking University EMAIL Tong Wu Peking University EMAIL Size Zheng Peking University EMAIL Xiuhong Li Peking University EMAIL Xuechao Wei Peking University EMAIL Shengen Yan Infinigence-AI EMAIL Meng Li Peking University EMAIL Yun Liang Peking University EMAIL
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Our code is now available at https://github.com/pku-liang/Ark Vale.
Open Datasets Yes We apply our method to Long Chat-7b-v1.5-32k [1] and use 6 datasets from Long Bench [9] for benchmarking: Hotpot QA [59], Narrative QA [35], Qasper [20], Gov Report [28], Trivia QA [30], and Passage Retrieval [9], along with the passkey-retrieval tasks.
Dataset Splits No The paper mentions using datasets for benchmarking and simulation, but it does not explicitly provide details about specific training/validation/test splits, their percentages, or how they were derived for the experiments.
Hardware Specification Yes Our experiment platform comprises an Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz and an NVIDIA A100 80GB PCIe GPU.
Software Dependencies Yes The software stack includes CUDA version 12.3, Py Torch [41, 8] version 2.3.0, and Hugging Face Transformers [57] version 4.40.0. We implement ARKVALE on top of Huggingface Transformers, with CUTLASS [54], Flash Infer [60], and RAFT [44] for certain kernels.
Experiment Setup Yes We configure four cache budget settings: 4096, 2048, 1024, and 512. ...with settings of a batch-size=4, page-size=32, and KV cache budgets of 512, 1024, 2048, and 4096. for page-size p and cache-capacity c (tokens), we set k = min(C, c/2)/p, where C is a hyper-parameters (default C = 40 * 32 = 1280).