reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalizing Reasoning Problems to Longer Lengths

Authors: Changnan Xiao, Bing Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the empirical study, we introduce the Co T schemes for reasoning problems like arithmetic, parity, addition, multiplication, and division to train a Transformer to achieve LG for these problems. Our experiments verify (1) for a Co T scheme of a problem, if it is (n, r)-consistent, it is solvable for LG, and (2) for the same problem, one Co T scheme may not be solvable for LG, but another may. Fig. 1 shows the problem achieves 100% accuracy for all test sets as the problem is (1, 17)-consistent.
Researcher Affiliation	Academia	Bing Liu Department of Computer Science University of Illinois Chicago EMAIL
Pseudocode	No	The paper describes algorithms and methods but does not present any formal pseudocode blocks or algorithm sections. For instance, the proof sketch for Theorem 3.6 describes steps like "The 1st layer applies a local padding mask... The 1st feed-forward layer maps... The 2nd attention layer has no mask..." but these are descriptive, not pseudocode.
Open Source Code	Yes	The code of our system can be downloaded at https://openreview.net/forum?id=zpENPcQSj1.
Open Datasets	No	Every training or test set is generated independently. The training set and each test set are generated in the same way for each problem except that for the training set, we also need to generate its Co T steps for each problem instance based on individual Co T schemes, but for each test set, we do not. The paper describes a custom data generation process and does not refer to any pre-existing public datasets or provide public access links to its generated data.
Dataset Splits	Yes	We use 6 test sets to evaluate the model learned for each problem. The 5 columns marked LG Test i in Table 1 give the length ranges of the 5 test sets for each problem, where the maximum lengths of the test sets increase gradually. The first test set has the same length range as that of the training set and thus shares the Train Length column. Every test set consists of 1k questions (test problem instances), which are in sequence format with no Co T steps, e.g., 3 + 2 2. The training data for each task contains 12.8M Co T steps.
Hardware Specification	Yes	Each experiment is running on a machine with 8 CPU cores.
Software Dependencies	No	The optimizer is Adam and the learning rate is 0.0001. The paper mentions 'Adam' as an optimizer but does not specify versions for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	The optimizer is Adam and the learning rate is 0.0001. The training data for each task contains 12.8M Co T steps. Due to the complexity of multiplication and division, they are additionally trained on 25.6M Co T steps with learning rate 0.000005.