reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: General Intelligence Requires Reward-based Pretraining

Authors: Seungwook Han, Jyothish Pari, Samuel J. Gershman, Pulkit Agrawal

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM s reasoning overfits to the training data and is limited in its transferability. Our results in Section 3 show that state-of-the-art LLMs struggle to transfer their algorithmic understanding to coding in new programming syntaxes.
Researcher Affiliation	Academia	1Improbable AI Lab, MIT 2Department of Psychology and Center for Brain Science, Harvard University. Correspondence to: Seungwook Han <EMAIL>, Jyothish Pari <EMAIL>.
Pseudocode	Yes	Algorithm 1 Curriculum-Guided Reasoning with External Memory
Open Source Code	No	The paper does not provide an explicit statement or a link to source code for the methodology described. It mentions using existing libraries or models like 'Google Deep Mind s mctx library' and 'Qwen 1.5B', but not its own implementation code.
Open Datasets	Yes	We collect 80,824 professional 9 9 Go game trajectories from online sources such as Go Quest (Go Quest, 2024) and other research archives (M uller, 2024; Brouwer, 2024)
Dataset Splits	Yes	We created 100 training examples and 100 test examples. The number of examples used for each language-task evaluation are as follows: Brainfk Copy: 100, Brainfk Print: 676, Brainf**k Sort: 100, Befunge Print: 100, Befunge Fibonacci: 1, Befunge Factorial: 15. We train the network for 10 epochs with a batch size of 1024, a learning rate of 10 3, and weight decay of 10 4.
Hardware Specification	Yes	This training procedure takes approximately 14 days on 4 A100 GPUs.
Software Dependencies	No	The paper mentions using 'Google Deep Mind s mctx library (Deep Mind, 2024)' but does not specify a version number for this or any other key software components, such as programming languages or frameworks.
Experiment Setup	Yes	We finetuned the Qwen/Qwen2.5-1.5B-Instruct model on 100 synthetic examples for 100 epochs using Low-Rank Adaptation (Lo RA) (Hu et al., 2021) with rank=256, alpha=32, and a dropout rate of 0.05 applied to the query, key, and value matrices (Q, K, V). The training employed a cosine learning rate schedule with an initial learning rate of 5e-4, a batch size of 64, and 10 warmup steps. ... We train the network for 10 epochs with a batch size of 1024, a learning rate of 10 3, and weight decay of 10 4. ... We use a batch size of 1024, a starting learning rate of 10 2 (with cosine decay over 200 total iterations), and weight decay of 10 4. ... The RL hyperparameters included a batch size of 36, a single PPO epoch per iteration, and a KL coefficient of 0.5.