reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What is Wrong with Perplexity for Long-context Language Modeling?

Authors: Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across a diverse suite of LLMs and long-context benchmarks show that Long PPL computed on natural language corpus exhibits a consistently strong correlation with their benchmark scores computed over various long-context tasks, e.g., -0.96 correlation in Figure 1(b) (bottom). Additionally, we introduce Long CE (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks.
Researcher Affiliation	Collaboration	1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 Alibaba Group 4 TUM CIT, MCML, MDSI 5 MIT EECS, CSAIL 6 Institute for Artificial Intelligence, Peking University
Pseudocode	No	The paper describes methods and algorithms using mathematical formulations and textual descriptions but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/PKU-ML/Long PPL.
Open Datasets	Yes	We calculate Long PPL on the Gov Report dataset (Huang et al., 2021), which consists of long sequences from government reports. For all the experiments, we use Long Bench (Bai et al., 2023b), Long Eval (Li et al., 2023a), and RULER (Hsieh et al., 2024) as the long-context benchmarks. We use PG-19 (Rae et al., 2020), a book dataset sourced from a library, and Pile-arxiv (Gao et al., 2020), a dataset consisting of Arxiv papers, as the training dataset.
Dataset Splits	No	The paper describes using datasets like PG-19 and Pile-arxiv for training and Long Bench, Long Eval, and RULER for evaluation, and specifies prompt lengths for evaluation (e.g., 'restrict the prompt length to 32k tokens'). However, it does not provide explicit details about standard training/validation/test dataset splits (e.g., exact percentages or sample counts) for these datasets during the fine-tuning process or evaluation.
Hardware Specification	Yes	We perform the experiments with 8 Nvidia A100 80GB GPUs using Pytorch (Paszke et al., 2019).
Software Dependencies	No	The paper mentions 'Pytorch (Paszke et al., 2019)' but does not specify a version number for the software dependency.
Experiment Setup	Yes	For EABF (Zhang et al., 2024c), we adopt the identical settings in the original paper, with a Ro PE base of 500k. For PI (Chen et al., 2023), we set the scaling factor to 8 since we want to extend the context window from 4k to 32k. We use a learning rate of 2 10 5 for Llama and 1 10 6 for Mistral, with no weight decay and a linear warmup of 20 steps along with Adam W (Loshchilov, 2017) with β1 = 0.9 and β2 = 0.95. We apply a global batch of 64 on PG-19 and 8 on Pile-arxiv. For the calculation of Long CE, we set γ = 5 in Equation 7 and use the same sliding window approach as described in Section 4.1 to improve training efficiency. The context length of si is set to be K = 4096. We set the hyperparameters as α = 2, β = 2, K = 4096.