CodeIO: Condensing Reasoning Patterns via Code Input-Output Prediction
Authors: Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate CODEI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. |
| Researcher Affiliation | Collaboration | 1Deep Seek-AI 2Shanghai Jiao Tong University 3HKUST. Correspondence to: Junlong Li <EMAIL>, Junxian He <EMAIL>. |
| Pseudocode | No | The paper includes Python code for `strict_check_size(obj)` in Appendix A, but it is presented as actual code rather than pseudocode and is not explicitly labeled as an algorithm block. |
| Open Source Code | Yes | Our data and models are available at https://github.com/hkust-nlp/CodeIO. |
| Open Datasets | Yes | Our data and models are available at https://github.com/hkust-nlp/CodeIO. We validate the effectiveness of CODEI/O and CODEI/O++ across four base models with parameter sizes ranging from 7B to 30B . Assessments across 14 different benchmarks show training on them enhances performance on a diverse range of reasoning tasks...Evaluation Benchmarks We evaluate all models on these benchmarks: DROP (Dua et al., 2019), Wino Grande (Sakaguchi et al., 2020), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), MMLU-STEM (Hendrycks et al., 2021a), BBH (Suzgun et al., 2023), GPQA (Rein et al., 2024), Crux Eval (Gu et al., 2024), Zebra Grid (Lin et al., 2025). |
| Dataset Splits | No | The paper describes the construction of the CODEI/O dataset and its total size (3.5M training samples) and notes that 'The distribution of input and output prediction instances is roughly balanced at 50%/50%.' It also mentions 'randomly sampling training instances' for scaling experiments. However, it does not provide explicit training, validation, and test splits for the CODEI/O dataset itself, which is used for an intermediate training stage, with evaluation performed on external benchmarks. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments. It mentions the base models used (e.g., Qwen 2.5 7B Coder, LLa MA 3.1 8B, Deep Seek Coder v2 Lite 16B, Gemma 2 27B) but not the hardware they were run on. |
| Software Dependencies | No | The paper mentions 'python input generator' and uses 'import numpy as np' and 'from pympler import asizeof' in code examples within the appendix, implying the use of Python and these libraries. However, it does not specify version numbers for Python or any of the libraries used, which is required for reproducibility. |
| Experiment Setup | Yes | During the first stage, we train for 1 epoch using a constant learning rate, which is set to 1e-5 for the three smaller models and 4e-6 for Gemma 2 27B. The batch size is 1024. In the second stage, we train for 700 steps with the batch size of 1024 as well, corresponding to about 3 epochs of the entire instruction-tuning dataset. The learning rate is set to 3e-5 for the three smaller models and 1e-5 for Gemma 2 27B, using a cosine scheduler decaying to 1e-6 and 3e-7, respectively. In both training stages, no warmup period is applied and the maximum sequence length is set to 4096. |