reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CodeIO: Condensing Reasoning Patterns via Code Input-Output Prediction

Authors: Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate CODEI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks.
Researcher Affiliation	Collaboration	1Deep Seek-AI 2Shanghai Jiao Tong University 3HKUST. Correspondence to: Junlong Li <EMAIL>, Junxian He <EMAIL>.
Pseudocode	No	The paper includes Python code for `strict_check_size(obj)` in Appendix A, but it is presented as actual code rather than pseudocode and is not explicitly labeled as an algorithm block.
Open Source Code	Yes	Our data and models are available at https://github.com/hkust-nlp/CodeIO.
Open Datasets	Yes	Our data and models are available at https://github.com/hkust-nlp/CodeIO. We validate the effectiveness of CODEI/O and CODEI/O++ across four base models with parameter sizes ranging from 7B to 30B . Assessments across 14 different benchmarks show training on them enhances performance on a diverse range of reasoning tasks...Evaluation Benchmarks We evaluate all models on these benchmarks: DROP (Dua et al., 2019), Wino Grande (Sakaguchi et al., 2020), GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), MMLU-STEM (Hendrycks et al., 2021a), BBH (Suzgun et al., 2023), GPQA (Rein et al., 2024), Crux Eval (Gu et al., 2024), Zebra Grid (Lin et al., 2025).
Dataset Splits	No	The paper describes the construction of the CODEI/O dataset and its total size (3.5M training samples) and notes that 'The distribution of input and output prediction instances is roughly balanced at 50%/50%.' It also mentions 'randomly sampling training instances' for scaling experiments. However, it does not provide explicit training, validation, and test splits for the CODEI/O dataset itself, which is used for an intermediate training stage, with evaluation performed on external benchmarks.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments. It mentions the base models used (e.g., Qwen 2.5 7B Coder, LLa MA 3.1 8B, Deep Seek Coder v2 Lite 16B, Gemma 2 27B) but not the hardware they were run on.
Software Dependencies	No	The paper mentions 'python input generator' and uses 'import numpy as np' and 'from pympler import asizeof' in code examples within the appendix, implying the use of Python and these libraries. However, it does not specify version numbers for Python or any of the libraries used, which is required for reproducibility.
Experiment Setup	Yes	During the first stage, we train for 1 epoch using a constant learning rate, which is set to 1e-5 for the three smaller models and 4e-6 for Gemma 2 27B. The batch size is 1024. In the second stage, we train for 700 steps with the batch size of 1024 as well, corresponding to about 3 epochs of the entire instruction-tuning dataset. The learning rate is set to 3e-5 for the three smaller models and 1e-5 for Gemma 2 27B, using a cosine scheduler decaying to 1e-6 and 3e-7, respectively. In both training stages, no warmup period is applied and the maximum sequence length is set to 4096.