Long Context Compression with Activation Beacon

Authors: Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations are conducted on various long-context tasks whose lengths (e.g., 128K) may far exceed the maximum training length (20K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various scenarios, achieving a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache. In our experiments, Activation Beacon is applied to Llama-2 (Touvron et al., 2023) and Qwen2 (Yang et al., 2024). We evaluate the resulted models on a variety of long-context tasks (whose lengths may be much longer than the training length, e.g., 128K), such as document understanding, few-shot learning, and Needle-in-a-Haystack.
Researcher Affiliation Collaboration 1: Beijing Academy of Artificial Intelligence, 2: Gaoling School of Artificial Intelligence, Renmin University of China. The authors are affiliated with a research institution (Beijing Academy of Artificial Intelligence) and a university (Renmin University of China), indicating a collaboration.
Pseudocode No The paper describes the compression mechanism and learning method in textual and mathematical form, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets Yes In pre-training, we use 1B tokens sampled from Red Pajama (Computer, 2023). In fine-tuning, we leverage Long Alpaca (Chen et al., 2023b), Book Sum (Kry sci nski et al., 2022), and synthetic data from GPT-3.5 (details in Appendix A). All the training samples are shorter than 20K. The books are randomly sampled from Books3 corpus, and the papers are sampled from Arxiv, both coming from the Pile (Gao et al., 2020). We further evaluate Activation Beacon on Needle-in-a-Haystack (NIAH) following the official settings (gkamradt, 2023) and on the Multi-Needle-in-a-Haystack task following Needle Bench (Li et al., 2024a).
Dataset Splits Yes The training consists of two phases. In pre-training, we use 1B tokens sampled from Red Pajama (Computer, 2023). ... In fine-tuning, we leverage Long Alpaca (Chen et al., 2023b), Book Sum (Kry sci nski et al., 2022), and synthetic data from GPT-3.5 (details in Appendix A). All the training samples are shorter than 20K. This dataset contains 16K long-context question answering instances (13K for books and 3K for papers). ... To mitigate forgetting, we also include 5000 samples from the pre-training data.
Hardware Specification Yes For all our experiments, we use Huggingface framework (Wolf et al., 2020) and one 8x A800 (80G) machine.
Software Dependencies Yes Flash Attention-2 (Dao, 2023) is used to speed up attention computation. For all our experiments, we use Huggingface framework (Wolf et al., 2020) and one 8x A800 (80G) machine.
Experiment Setup Yes The training consists of two phases. ... The batch size is 8. The learning rate is 5e-5 for pre-training and 1e-5 for fine-tuning, with linear decay and no warmup. As introduced, the LLM s original parameters are frozen throughout the training process. During training, we randomly sample the compression ratio for each chunk, enhancing the model s flexibility to tackle different compression ratios in downstream tasks. ... The chunk size w is 1024 for Llama-2 and 2048 for Qwen-2.