What is Wrong with Perplexity for Long-context Language Modeling?
Authors: Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, Yisen Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across a diverse suite of LLMs and long-context benchmarks show that Long PPL computed on natural language corpus exhibits a consistently strong correlation with their benchmark scores computed over various long-context tasks, e.g., -0.96 correlation in Figure 1(b) (bottom). Additionally, we introduce Long CE (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. |
| Researcher Affiliation | Collaboration | 1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 Alibaba Group 4 TUM CIT, MCML, MDSI 5 MIT EECS, CSAIL 6 Institute for Artificial Intelligence, Peking University |
| Pseudocode | No | The paper describes methods and algorithms using mathematical formulations and textual descriptions but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/PKU-ML/Long PPL. |
| Open Datasets | Yes | We calculate Long PPL on the Gov Report dataset (Huang et al., 2021), which consists of long sequences from government reports. For all the experiments, we use Long Bench (Bai et al., 2023b), Long Eval (Li et al., 2023a), and RULER (Hsieh et al., 2024) as the long-context benchmarks. We use PG-19 (Rae et al., 2020), a book dataset sourced from a library, and Pile-arxiv (Gao et al., 2020), a dataset consisting of Arxiv papers, as the training dataset. |
| Dataset Splits | No | The paper describes using datasets like PG-19 and Pile-arxiv for training and Long Bench, Long Eval, and RULER for evaluation, and specifies prompt lengths for evaluation (e.g., 'restrict the prompt length to 32k tokens'). However, it does not provide explicit details about standard training/validation/test dataset splits (e.g., exact percentages or sample counts) for these datasets during the fine-tuning process or evaluation. |
| Hardware Specification | Yes | We perform the experiments with 8 Nvidia A100 80GB GPUs using Pytorch (Paszke et al., 2019). |
| Software Dependencies | No | The paper mentions 'Pytorch (Paszke et al., 2019)' but does not specify a version number for the software dependency. |
| Experiment Setup | Yes | For EABF (Zhang et al., 2024c), we adopt the identical settings in the original paper, with a Ro PE base of 500k. For PI (Chen et al., 2023), we set the scaling factor to 8 since we want to extend the context window from 4k to 32k. We use a learning rate of 2 10 5 for Llama and 1 10 6 for Mistral, with no weight decay and a linear warmup of 20 steps along with Adam W (Loshchilov, 2017) with β1 = 0.9 and β2 = 0.95. We apply a global batch of 64 on PG-19 and 8 on Pile-arxiv. For the calculation of Long CE, we set γ = 5 in Equation 7 and use the same sliding window approach as described in Section 4.1 to improve training efficiency. The context length of si is set to be K = 4096. We set the hyperparameters as α = 2, β = 2, K = 4096. |