Long Context Compression with Activation Beacon
Authors: Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations are conducted on various long-context tasks whose lengths (e.g., 128K) may far exceed the maximum training length (20K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various scenarios, achieving a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache. In our experiments, Activation Beacon is applied to Llama-2 (Touvron et al., 2023) and Qwen2 (Yang et al., 2024). We evaluate the resulted models on a variety of long-context tasks (whose lengths may be much longer than the training length, e.g., 128K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. |
| Researcher Affiliation | Collaboration | 1: Beijing Academy of Artificial Intelligence, 2: Gaoling School of Artificial Intelligence, Renmin University of China. The authors are affiliated with a research institution (Beijing Academy of Artificial Intelligence) and a university (Renmin University of China), indicating a collaboration. |
| Pseudocode | No | The paper describes the compression mechanism and learning method in textual and mathematical form, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | In pre-training, we use 1B tokens sampled from Red Pajama (Computer, 2023). In fine-tuning, we leverage Long Alpaca (Chen et al., 2023b), Book Sum (Kry sci nski et al., 2022), and synthetic data from GPT-3.5 (details in Appendix A). All the training samples are shorter than 20K. The books are randomly sampled from Books3 corpus, and the papers are sampled from Arxiv, both coming from the Pile (Gao et al., 2020). We further evaluate Activation Beacon on Needle-in-a-Haystack (NIAH) following the official settings (gkamradt, 2023) and on the Multi-Needle-in-a-Haystack task following Needle Bench (Li et al., 2024a). |
| Dataset Splits | Yes | The training consists of two phases. In pre-training, we use 1B tokens sampled from Red Pajama (Computer, 2023). ... In fine-tuning, we leverage Long Alpaca (Chen et al., 2023b), Book Sum (Kry sci nski et al., 2022), and synthetic data from GPT-3.5 (details in Appendix A). All the training samples are shorter than 20K. This dataset contains 16K long-context question answering instances (13K for books and 3K for papers). ... To mitigate forgetting, we also include 5000 samples from the pre-training data. |
| Hardware Specification | Yes | For all our experiments, we use Huggingface framework (Wolf et al., 2020) and one 8x A800 (80G) machine. |
| Software Dependencies | Yes | Flash Attention-2 (Dao, 2023) is used to speed up attention computation. For all our experiments, we use Huggingface framework (Wolf et al., 2020) and one 8x A800 (80G) machine. |
| Experiment Setup | Yes | The training consists of two phases. ... The batch size is 8. The learning rate is 5e-5 for pre-training and 1e-5 for fine-tuning, with linear decay and no warmup. As introduced, the LLM s original parameters are frozen throughout the training process. During training, we randomly sample the compression ratio for each chunk, enhancing the model s flexibility to tackle different compression ratios in downstream tasks. ... The chunk size w is 1024 for Llama-2 and 2048 for Qwen-2. |