Radar: Fast Long-Context Decoding for Any Transformer

Authors: Yongchang Hao, Mengyao Zhai, Hossein Hajimirsadeghi, Sepidehsadat Hosseini, Frederick Tung

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers. The code is publicly available at https://github.com/BorealisAI/radar-decoding. 3 EXPERIMENTS In this section, we conduct extensive experiments to verify the effectiveness of Radar in comparison with other methods on diverse tasks and models.
Researcher Affiliation Collaboration Yongchang Hao University of Alberta & RBC Borealis EMAIL Mengyao Zhai RBC Borealis EMAIL
Pseudocode Yes A COMPLETE ALGORITHM We provide the complete algorithm as follows. Algorithm 1 The overall algorithm of Radar
Open Source Code Yes The code is publicly available at https://github.com/BorealisAI/radar-decoding.
Open Datasets Yes Datasets. Following previous work (Xiao et al., 2024; Zhang et al., 2023), we use the perplexity on the first sample in the PG-19 dataset as the main test bed. Specifically, we evaluate the overall perplexity by feeding the ground-truth tokens one by one. Since PG-19 only contains natural language, we additionally use a code sample from The Stack (Lozhkov et al., 2024) for a broader comparison. Following previous work (Li et al., 2024), we use the Long Bench dataset (Bai et al., 2024) as the main benchmark. For the datasets, we follow the previous work and similarly obtain them from Hugging Face: THUDM/LongBench emozilla/pg19-test bigcode/the-stack-smol
Dataset Splits Yes Datasets. Following previous work (Xiao et al., 2024; Zhang et al., 2023), we use the perplexity on the first sample in the PG-19 dataset as the main test bed. To simulate the real-world use cases, we prefill the first 16,384 tokens into the model as a prompt. Following the standard Long Bench pipeline (Bai et al., 2024), we truncate the context from the middle of the prompt length exceeds the pre-training length. For the datasets, we follow the previous work and similarly obtain them from Hugging Face: THUDM/LongBench emozilla/pg19-test bigcode/the-stack-smol
Hardware Specification Yes Each experiment is conducted on a single A100 GPU with 40GB of memory. Each experiment is conducted on one A100 GPU. We run all of our experiments on A100 GPUs.
Software Dependencies No The paper mentions using specific models like Llama-Meta-3.1-8B, Mistral-7B-v0.3, Llama-7b, Llama-2-7b-chat-hf, and Mistral-7B-Instruct-v0.2. However, it does not specify the version numbers of any underlying software libraries (e.g., Python, PyTorch, TensorFlow, Hugging Face Transformers library versions) used for implementation or experimentation.
Experiment Setup Yes For our Radar, we choose the top 64 segments for each query with the random projection dimension set to 2048. The effect of these two parameters is studied in Section 3.3. Following the default setting in previous work (Xiao et al., 2024; Zhang et al., 2023) the sliding window length is set to 1024 for all runs. To simulate the real-world use cases, we prefill the first 16,384 tokens into the model as a prompt.