Position: General Intelligence Requires Reward-based Pretraining
Authors: Seungwook Han, Jyothish Pari, Samuel J. Gershman, Pulkit Agrawal
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM s reasoning overfits to the training data and is limited in its transferability. Our results in Section 3 show that state-of-the-art LLMs struggle to transfer their algorithmic understanding to coding in new programming syntaxes. |
| Researcher Affiliation | Academia | 1Improbable AI Lab, MIT 2Department of Psychology and Center for Brain Science, Harvard University. Correspondence to: Seungwook Han <EMAIL>, Jyothish Pari <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Curriculum-Guided Reasoning with External Memory |
| Open Source Code | No | The paper does not provide an explicit statement or a link to source code for the methodology described. It mentions using existing libraries or models like 'Google Deep Mind s mctx library' and 'Qwen 1.5B', but not its own implementation code. |
| Open Datasets | Yes | We collect 80,824 professional 9 9 Go game trajectories from online sources such as Go Quest (Go Quest, 2024) and other research archives (M uller, 2024; Brouwer, 2024) |
| Dataset Splits | Yes | We created 100 training examples and 100 test examples. The number of examples used for each language-task evaluation are as follows: Brainf**k Copy: 100, Brainf**k Print: 676, Brainf**k Sort: 100, Befunge Print: 100, Befunge Fibonacci: 1, Befunge Factorial: 15. We train the network for 10 epochs with a batch size of 1024, a learning rate of 10 3, and weight decay of 10 4. |
| Hardware Specification | Yes | This training procedure takes approximately 14 days on 4 A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Google Deep Mind s mctx library (Deep Mind, 2024)' but does not specify a version number for this or any other key software components, such as programming languages or frameworks. |
| Experiment Setup | Yes | We finetuned the Qwen/Qwen2.5-1.5B-Instruct model on 100 synthetic examples for 100 epochs using Low-Rank Adaptation (Lo RA) (Hu et al., 2021) with rank=256, alpha=32, and a dropout rate of 0.05 applied to the query, key, and value matrices (Q, K, V). The training employed a cosine learning rate schedule with an initial learning rate of 5e-4, a batch size of 64, and 10 warmup steps. ... We train the network for 10 epochs with a batch size of 1024, a learning rate of 10 3, and weight decay of 10 4. ... We use a batch size of 1024, a starting learning rate of 10 2 (with cosine decay over 200 total iterations), and weight decay of 10 4. ... The RL hyperparameters included a batch size of 36, a single PPO epoch per iteration, and a KL coefficient of 0.5. |