reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Long Context Compression with Activation Beacon

Authors: Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations are conducted on various long-context tasks whose lengths (e.g., 128K) may far exceed the maximum training length (20K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various scenarios, achieving a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache. In our experiments, Activation Beacon is applied to Llama-2 (Touvron et al., 2023) and Qwen2 (Yang et al., 2024). We evaluate the resulted models on a variety of long-context tasks (whose lengths may be much longer than the training length, e.g., 128K), such as document understanding, few-shot learning, and Needle-in-a-Haystack.
Researcher Affiliation	Collaboration	1: Beijing Academy of Artificial Intelligence, 2: Gaoling School of Artificial Intelligence, Renmin University of China. The authors are affiliated with a research institution (Beijing Academy of Artificial Intelligence) and a university (Renmin University of China), indicating a collaboration.
Pseudocode	No	The paper describes the compression mechanism and learning method in textual and mathematical form, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets	Yes	In pre-training, we use 1B tokens sampled from Red Pajama (Computer, 2023). In fine-tuning, we leverage Long Alpaca (Chen et al., 2023b), Book Sum (Kry sci nski et al., 2022), and synthetic data from GPT-3.5 (details in Appendix A). All the training samples are shorter than 20K. The books are randomly sampled from Books3 corpus, and the papers are sampled from Arxiv, both coming from the Pile (Gao et al., 2020). We further evaluate Activation Beacon on Needle-in-a-Haystack (NIAH) following the official settings (gkamradt, 2023) and on the Multi-Needle-in-a-Haystack task following Needle Bench (Li et al., 2024a).
Dataset Splits	Yes	The training consists of two phases. In pre-training, we use 1B tokens sampled from Red Pajama (Computer, 2023). ... In fine-tuning, we leverage Long Alpaca (Chen et al., 2023b), Book Sum (Kry sci nski et al., 2022), and synthetic data from GPT-3.5 (details in Appendix A). All the training samples are shorter than 20K. This dataset contains 16K long-context question answering instances (13K for books and 3K for papers). ... To mitigate forgetting, we also include 5000 samples from the pre-training data.
Hardware Specification	Yes	For all our experiments, we use Huggingface framework (Wolf et al., 2020) and one 8x A800 (80G) machine.
Software Dependencies	Yes	Flash Attention-2 (Dao, 2023) is used to speed up attention computation. For all our experiments, we use Huggingface framework (Wolf et al., 2020) and one 8x A800 (80G) machine.
Experiment Setup	Yes	The training consists of two phases. ... The batch size is 8. The learning rate is 5e-5 for pre-training and 1e-5 for fine-tuning, with linear decay and no warmup. As introduced, the LLM s original parameters are frozen throughout the training process. During training, we randomly sample the compression ratio for each chunk, enhancing the model s flexibility to tackle different compression ratios in downstream tasks. ... The chunk size w is 1024 for Llama-2 and 2048 for Qwen-2.