Long-Short Alignment for Effective Long-Context Modeling in LLMs
Authors: Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/ PKU-ML/Long Short Alignment. |
| Researcher Affiliation | Academia | 1State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, China 2NUS, Singapore 3MIT CSAIL, USA 4Institute for Artificial Intelligence, Peking University, China. |
| Pseudocode | Yes | A detailed Pytorch-like algorithm is provided in Appendix E and an overall illustration can be found in Figure 3. |
| Open Source Code | Yes | Code is available at https://github.com/ PKU-ML/Long Short Alignment. |
| Open Datasets | Yes | For perplexity evaluation, we select a subset from the Red Pajama-Book corpus (Computer, 2023), following the protocol in (Chen et al., 2024). Long Bench-E is a multitask benchmark that comprehensively evaluates large language models ability to understand long contexts, with task lengths averaging between 5k and 32k tokens. |
| Dataset Splits | No | The paper mentions using 'validation sets' (e.g., in Section 5.1 and 5.2) and describes sampling sequence lengths for training and testing, but it does not explicitly provide the specific percentages, sample counts, or methodology for splitting the core datasets (like Red Pajama-Book or PG19) into training, validation, and test sets. It mentions selecting 'a subset' for perplexity evaluation, which is vague. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper provides PyTorch-like pseudocode in Appendix E, but it does not specify version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | In our experiments, we use Llama2-7b (Touvron et al., 2023) as the base model and apply the CLEX (Chen et al., 2024) adjustment method. We use two datasets: the Red Pajama-Book corpus (Computer, 2023) and PG19 (Rae et al., 2019). The experiments are conducted with a context length of 4,096, a batch size of 64, and a maximum of 200 training steps. For the regularization coefficient α, we test values of 0.1 and 0.5. |