reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Long-Short Alignment for Effective Long-Context Modeling in LLMs

Authors: Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/ PKU-ML/Long Short Alignment.
Researcher Affiliation	Academia	1State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, China 2NUS, Singapore 3MIT CSAIL, USA 4Institute for Artificial Intelligence, Peking University, China.
Pseudocode	Yes	A detailed Pytorch-like algorithm is provided in Appendix E and an overall illustration can be found in Figure 3.
Open Source Code	Yes	Code is available at https://github.com/ PKU-ML/Long Short Alignment.
Open Datasets	Yes	For perplexity evaluation, we select a subset from the Red Pajama-Book corpus (Computer, 2023), following the protocol in (Chen et al., 2024). Long Bench-E is a multitask benchmark that comprehensively evaluates large language models ability to understand long contexts, with task lengths averaging between 5k and 32k tokens.
Dataset Splits	No	The paper mentions using 'validation sets' (e.g., in Section 5.1 and 5.2) and describes sampling sequence lengths for training and testing, but it does not explicitly provide the specific percentages, sample counts, or methodology for splitting the core datasets (like Red Pajama-Book or PG19) into training, validation, and test sets. It mentions selecting 'a subset' for perplexity evaluation, which is vague.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper provides PyTorch-like pseudocode in Appendix E, but it does not specify version numbers for PyTorch or other software dependencies.
Experiment Setup	Yes	In our experiments, we use Llama2-7b (Touvron et al., 2023) as the base model and apply the CLEX (Chen et al., 2024) adjustment method. We use two datasets: the Red Pajama-Book corpus (Computer, 2023) and PG19 (Rae et al., 2019). The experiments are conducted with a context length of 4,096, a batch size of 64, and a maximum of 200 training steps. For the regularization coefficient α, we test values of 0.1 and 0.5.