ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

Authors: Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiment Experimental results show that PARALLELCOMP enables an 8B model (trained on 8K context) to achieve 91.17% of GPT-4 s performance under ultra-long contexts, outperforming closed-source models such as Claude-2 and Kimi-Chat.
Researcher Affiliation Collaboration 1The University of Hong Kong, 2Nanjing University, 3The Chinese University of Hong Kong, 4The Ohio State University, 5The University of California, Los Angeles, 6Sun Yat-Sen University, 7Tencent, 8Hong Kong Polytechnic University.
Pseudocode No The paper describes methods in text and uses diagrams (e.g., Figure 2) to illustrate processes, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes We release the code at https://github.com/ menik1126/Parallel Comp.
Open Datasets Yes We compare our method with existing length extrapolation approaches... on Long Bench (Bai et al., 2023) and Infinite Bench (Zhang et al., 2024)... We present the results of perplexity (PPL) calculations on the Narrative QA (Koˇcisk y et al., 2018) test set.
Dataset Splits Yes We present the results of perplexity (PPL) calculations on the Narrative QA (Koˇcisk y et al., 2018) test set.
Hardware Specification Yes enabling 8B-parameter LLMs to extrapolate from 8K to 128K tokens on a single A100 80GB GPU
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For the hyperparameter τ, on Longbench, we retain 3 chunks from the priority queue except for PRe, in which dataset we retain only 1 chunk. On Infinite Bench, we retain 1 chunk for retrieval tasks and 3 chunks for other tasks from the priority queue. In all datasets, the context length of each chunk, including the query, is the maximum pre-training length of the model. Rs is obtained from the first 100 tokens of the chunk, Rr is obtained from the last 100 tokens of the chunk, and the remaining part of the chunk obtains Rm.