reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction

Authors: Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, Yun Liang

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results show that ARKVALE performs well on various long context tasks with negligible accuracy loss under 2k 4k cache budget and can improve decoding latency up to 2.2 (1.7 in average) and batching throughput up to 4.6 (3.5 in average).
Researcher Affiliation	Collaboration	Renze Chen Peking University EMAIL Zhuofeng Wang Peking University EMAIL Beiquan Cao Peking University EMAIL Tong Wu Peking University EMAIL Size Zheng Peking University EMAIL Xiuhong Li Peking University EMAIL Xuechao Wei Peking University EMAIL Shengen Yan Infinigence-AI EMAIL Meng Li Peking University EMAIL Yun Liang Peking University EMAIL
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is now available at https://github.com/pku-liang/Ark Vale.
Open Datasets	Yes	We apply our method to Long Chat-7b-v1.5-32k [1] and use 6 datasets from Long Bench [9] for benchmarking: Hotpot QA [59], Narrative QA [35], Qasper [20], Gov Report [28], Trivia QA [30], and Passage Retrieval [9], along with the passkey-retrieval tasks.
Dataset Splits	No	The paper mentions using datasets for benchmarking and simulation, but it does not explicitly provide details about specific training/validation/test splits, their percentages, or how they were derived for the experiments.
Hardware Specification	Yes	Our experiment platform comprises an Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz and an NVIDIA A100 80GB PCIe GPU.
Software Dependencies	Yes	The software stack includes CUDA version 12.3, Py Torch [41, 8] version 2.3.0, and Hugging Face Transformers [57] version 4.40.0. We implement ARKVALE on top of Huggingface Transformers, with CUTLASS [54], Flash Infer [60], and RAFT [44] for certain kernels.
Experiment Setup	Yes	We configure four cache budget settings: 4096, 2048, 1024, and 512. ...with settings of a batch-size=4, page-size=32, and KV cache budgets of 512, 1024, 2048, and 4096. for page-size p and cache-capacity c (tokens), we set k = min(C, c/2)/p, where C is a hyper-parameters (default C = 40 * 32 = 1280).