reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

Authors: Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5. Experiment Experimental results show that PARALLELCOMP enables an 8B model (trained on 8K context) to achieve 91.17% of GPT-4 s performance under ultra-long contexts, outperforming closed-source models such as Claude-2 and Kimi-Chat.
Researcher Affiliation	Collaboration	1The University of Hong Kong, 2Nanjing University, 3The Chinese University of Hong Kong, 4The Ohio State University, 5The University of California, Los Angeles, 6Sun Yat-Sen University, 7Tencent, 8Hong Kong Polytechnic University.
Pseudocode	No	The paper describes methods in text and uses diagrams (e.g., Figure 2) to illustrate processes, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	We release the code at https://github.com/ menik1126/Parallel Comp.
Open Datasets	Yes	We compare our method with existing length extrapolation approaches... on Long Bench (Bai et al., 2023) and Infinite Bench (Zhang et al., 2024)... We present the results of perplexity (PPL) calculations on the Narrative QA (Koˇcisk y et al., 2018) test set.
Dataset Splits	Yes	We present the results of perplexity (PPL) calculations on the Narrative QA (Koˇcisk y et al., 2018) test set.
Hardware Specification	Yes	enabling 8B-parameter LLMs to extrapolate from 8K to 128K tokens on a single A100 80GB GPU
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For the hyperparameter τ, on Longbench, we retain 3 chunks from the priority queue except for PRe, in which dataset we retain only 1 chunk. On Infinite Bench, we retain 1 chunk for retrieval tasks and 3 chunks for other tasks from the priority queue. In all datasets, the context length of each chunk, including the query, is the maximum pre-training length of the model. Rs is obtained from the first 100 tokens of the chunk, Rr is obtained from the last 100 tokens of the chunk, and the remaining part of the chunk obtains Rm.