reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Radar: Fast Long-Context Decoding for Any Transformer

Authors: Yongchang Hao, Mengyao Zhai, Hossein Hajimirsadeghi, Sepidehsadat Hosseini, Frederick Tung

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers. The code is publicly available at https://github.com/BorealisAI/radar-decoding. 3 EXPERIMENTS In this section, we conduct extensive experiments to verify the effectiveness of Radar in comparison with other methods on diverse tasks and models.
Researcher Affiliation	Collaboration	Yongchang Hao University of Alberta & RBC Borealis EMAIL Mengyao Zhai RBC Borealis EMAIL
Pseudocode	Yes	A COMPLETE ALGORITHM We provide the complete algorithm as follows. Algorithm 1 The overall algorithm of Radar
Open Source Code	Yes	The code is publicly available at https://github.com/BorealisAI/radar-decoding.
Open Datasets	Yes	Datasets. Following previous work (Xiao et al., 2024; Zhang et al., 2023), we use the perplexity on the first sample in the PG-19 dataset as the main test bed. Specifically, we evaluate the overall perplexity by feeding the ground-truth tokens one by one. Since PG-19 only contains natural language, we additionally use a code sample from The Stack (Lozhkov et al., 2024) for a broader comparison. Following previous work (Li et al., 2024), we use the Long Bench dataset (Bai et al., 2024) as the main benchmark. For the datasets, we follow the previous work and similarly obtain them from Hugging Face: THUDM/LongBench emozilla/pg19-test bigcode/the-stack-smol
Dataset Splits	Yes	Datasets. Following previous work (Xiao et al., 2024; Zhang et al., 2023), we use the perplexity on the first sample in the PG-19 dataset as the main test bed. To simulate the real-world use cases, we prefill the first 16,384 tokens into the model as a prompt. Following the standard Long Bench pipeline (Bai et al., 2024), we truncate the context from the middle of the prompt length exceeds the pre-training length. For the datasets, we follow the previous work and similarly obtain them from Hugging Face: THUDM/LongBench emozilla/pg19-test bigcode/the-stack-smol
Hardware Specification	Yes	Each experiment is conducted on a single A100 GPU with 40GB of memory. Each experiment is conducted on one A100 GPU. We run all of our experiments on A100 GPUs.
Software Dependencies	No	The paper mentions using specific models like Llama-Meta-3.1-8B, Mistral-7B-v0.3, Llama-7b, Llama-2-7b-chat-hf, and Mistral-7B-Instruct-v0.2. However, it does not specify the version numbers of any underlying software libraries (e.g., Python, PyTorch, TensorFlow, Hugging Face Transformers library versions) used for implementation or experimentation.
Experiment Setup	Yes	For our Radar, we choose the top 64 segments for each query with the random projection dimension set to 2048. The effect of these two parameters is studied in Section 3.3. Following the default setting in previous work (Xiao et al., 2024; Zhang et al., 2023) the sliding window length is set to 1024 for all runs. To simulate the real-world use cases, we prefill the first 16,384 tokens into the model as a prompt.