Low-Dimension-to-High-Dimension Generalization and Its Implications for Length Generalization
Authors: Yang Chen, Long Yang, Yitao Liang, Zhouchen Lin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare the length generalization performance of RPE and RPE-Square in two tasks: Unaligned Copy and URF Addition. In the unaligned copy, the input is a string whose length is not aligned to a fixed length, and the target is one copy of the input string. An unaligned copy instance of scale n is like [BOS] x0 . . . xn 1 = x0 . . . xn 1 [EOS] . The URF addition is illustrated in Example 1. To examine length generalization, the models are trained only on smallscale instances but evaluated on instances of larger scales. More details of the experiments are in Appendix D.2. The experiment results are presented in Figure 2. |
| Researcher Affiliation | Academia | 1State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2Institute for Artificial Intelligence, Peking University 3Beijing Institute for General Artificial Intelligence 4Pazhou Laboratory (Huangpu), Guangzhou, Guangdong, China. Correspondence to: Zhouchen Lin <EMAIL>, Yitao Liang <EMAIL>. |
| Pseudocode | No | No structured pseudocode or algorithm blocks are present. Methodological steps are described in prose and mathematical formulations. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or provide a link to a code repository for the described methodology. |
| Open Datasets | No | The paper describes how synthetic datasets were generated for specific tasks (e.g., 'URF n-addition training data, we first sample the lengths of two addends uniformly from {1, . . . , n} {1, . . . , n}'). However, it does not provide concrete access information (link, DOI, repository, or citation) for these datasets to be publicly available. |
| Dataset Splits | Yes | In the unaligned copy, 'We sample 2000 n-length instances for each n = 1, . . . , 5 as the training data. In the evaluation, we examine the learned models on instances of length 1 10.' For URF Addition, 'we train with 10000 URF 4-addition samples.' These statements specify the data used for training and evaluation. |
| Hardware Specification | Yes | The experiments are run on a server with Ubuntu. The models are trained on two NVIDIA Ge Force RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions using 'GPT-2 tokenizer' and that the 'implementation is adapted from Hugging Face (Wolf et al., 2020)'. However, specific version numbers for these software components or other libraries are not provided. |
| Experiment Setup | Yes | The model is trained by Adam W with the cosine scheduler, where the initial learning rate is 0.0005, the weight decay is 1.0, the warmup ratio is 0.05, the gradient accumulation step is 2, and the per-device training batch size is 256. We train the models by Adam W, with the initial learning rate 0.0005, the weight decay 1.0, and the cosine scheduler. The warmup ratio is 0.05. The gradient accumulation step is 2. The per-device training batch size is 128. |