Long-Short Alignment for Effective Long-Context Modeling in LLMs

Authors: Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/ PKU-ML/Long Short Alignment.
Researcher Affiliation Academia 1State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, China 2NUS, Singapore 3MIT CSAIL, USA 4Institute for Artificial Intelligence, Peking University, China.
Pseudocode Yes A detailed Pytorch-like algorithm is provided in Appendix E and an overall illustration can be found in Figure 3.
Open Source Code Yes Code is available at https://github.com/ PKU-ML/Long Short Alignment.
Open Datasets Yes For perplexity evaluation, we select a subset from the Red Pajama-Book corpus (Computer, 2023), following the protocol in (Chen et al., 2024). Long Bench-E is a multitask benchmark that comprehensively evaluates large language models ability to understand long contexts, with task lengths averaging between 5k and 32k tokens.
Dataset Splits No The paper mentions using 'validation sets' (e.g., in Section 5.1 and 5.2) and describes sampling sequence lengths for training and testing, but it does not explicitly provide the specific percentages, sample counts, or methodology for splitting the core datasets (like Red Pajama-Book or PG19) into training, validation, and test sets. It mentions selecting 'a subset' for perplexity evaluation, which is vague.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper provides PyTorch-like pseudocode in Appendix E, but it does not specify version numbers for PyTorch or other software dependencies.
Experiment Setup Yes In our experiments, we use Llama2-7b (Touvron et al., 2023) as the base model and apply the CLEX (Chen et al., 2024) adjustment method. We use two datasets: the Red Pajama-Book corpus (Computer, 2023) and PG19 (Rae et al., 2019). The experiments are conducted with a context length of 4,096, a batch size of 64, and a maximum of 200 training steps. For the regularization coefficient α, we test values of 0.1 and 0.5.