reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Low-Dimension-to-High-Dimension Generalization and Its Implications for Length Generalization

Authors: Yang Chen, Long Yang, Yitao Liang, Zhouchen Lin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare the length generalization performance of RPE and RPE-Square in two tasks: Unaligned Copy and URF Addition. In the unaligned copy, the input is a string whose length is not aligned to a fixed length, and the target is one copy of the input string. An unaligned copy instance of scale n is like [BOS] x0 . . . xn 1 = x0 . . . xn 1 [EOS] . The URF addition is illustrated in Example 1. To examine length generalization, the models are trained only on smallscale instances but evaluated on instances of larger scales. More details of the experiments are in Appendix D.2. The experiment results are presented in Figure 2.
Researcher Affiliation	Academia	1State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2Institute for Artificial Intelligence, Peking University 3Beijing Institute for General Artificial Intelligence 4Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China. Correspondence to: Zhouchen Lin <EMAIL>, Yitao Liang <EMAIL>.
Pseudocode	No	No structured pseudocode or algorithm blocks are present. Methodological steps are described in prose and mathematical formulations.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or provide a link to a code repository for the described methodology.
Open Datasets	No	The paper describes how synthetic datasets were generated for specific tasks (e.g., 'URF n-addition training data, we first sample the lengths of two addends uniformly from {1, . . . , n} {1, . . . , n}'). However, it does not provide concrete access information (link, DOI, repository, or citation) for these datasets to be publicly available.
Dataset Splits	Yes	In the unaligned copy, 'We sample 2000 n-length instances for each n = 1, . . . , 5 as the training data. In the evaluation, we examine the learned models on instances of length 1 10.' For URF Addition, 'we train with 10000 URF 4-addition samples.' These statements specify the data used for training and evaluation.
Hardware Specification	Yes	The experiments are run on a server with Ubuntu. The models are trained on two NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies	No	The paper mentions using 'GPT-2 tokenizer' and that the 'implementation is adapted from Hugging Face (Wolf et al., 2020)'. However, specific version numbers for these software components or other libraries are not provided.
Experiment Setup	Yes	The model is trained by Adam W with the cosine scheduler, where the initial learning rate is 0.0005, the weight decay is 1.0, the warmup ratio is 0.05, the gradient accumulation step is 2, and the per-device training batch size is 256. We train the models by Adam W, with the initial learning rate 0.0005, the weight decay 1.0, and the cosine scheduler. The warmup ratio is 0.05. The gradient accumulation step is 2. The per-device training batch size is 128.