Position: General Intelligence Requires Reward-based Pretraining

Authors: Seungwook Han, Jyothish Pari, Samuel J. Gershman, Pulkit Agrawal

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM s reasoning overfits to the training data and is limited in its transferability. Our results in Section 3 show that state-of-the-art LLMs struggle to transfer their algorithmic understanding to coding in new programming syntaxes.
Researcher Affiliation Academia 1Improbable AI Lab, MIT 2Department of Psychology and Center for Brain Science, Harvard University. Correspondence to: Seungwook Han <EMAIL>, Jyothish Pari <EMAIL>.
Pseudocode Yes Algorithm 1 Curriculum-Guided Reasoning with External Memory
Open Source Code No The paper does not provide an explicit statement or a link to source code for the methodology described. It mentions using existing libraries or models like 'Google Deep Mind s mctx library' and 'Qwen 1.5B', but not its own implementation code.
Open Datasets Yes We collect 80,824 professional 9 9 Go game trajectories from online sources such as Go Quest (Go Quest, 2024) and other research archives (M uller, 2024; Brouwer, 2024)
Dataset Splits Yes We created 100 training examples and 100 test examples. The number of examples used for each language-task evaluation are as follows: Brainf**k Copy: 100, Brainf**k Print: 676, Brainf**k Sort: 100, Befunge Print: 100, Befunge Fibonacci: 1, Befunge Factorial: 15. We train the network for 10 epochs with a batch size of 1024, a learning rate of 10 3, and weight decay of 10 4.
Hardware Specification Yes This training procedure takes approximately 14 days on 4 A100 GPUs.
Software Dependencies No The paper mentions using 'Google Deep Mind s mctx library (Deep Mind, 2024)' but does not specify a version number for this or any other key software components, such as programming languages or frameworks.
Experiment Setup Yes We finetuned the Qwen/Qwen2.5-1.5B-Instruct model on 100 synthetic examples for 100 epochs using Low-Rank Adaptation (Lo RA) (Hu et al., 2021) with rank=256, alpha=32, and a dropout rate of 0.05 applied to the query, key, and value matrices (Q, K, V). The training employed a cosine learning rate schedule with an initial learning rate of 5e-4, a batch size of 64, and 10 warmup steps. ... We train the network for 10 epochs with a batch size of 1024, a learning rate of 10 3, and weight decay of 10 4. ... We use a batch size of 1024, a starting learning rate of 10 2 (with cosine decay over 200 total iterations), and weight decay of 10 4. ... The RL hyperparameters included a batch size of 36, a single PPO epoch per iteration, and a KL coefficient of 0.5.