reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Automatic Curriculum Expert Iteration for Reliable LLM Reasoning

Authors: Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare AUTO-CEI with various SOTA baselines across logical reasoning, mathematics, and planning tasks, where AUTO-CEI achieves superior alignment by effectively balancing assertiveness and conservativeness. We carried out comprehensive experiments in various reasoning benchmarks, and AUTO-CEI significantly outperforms the concurrent baseline methods, boosting precision by 10-24% while maintaining a relatively low refusal rate of 18-36% across diverse reasoning tasks in planning, logical and math reasoning.
Researcher Affiliation	Collaboration	Zirui Zhao1 Hanze Dong2 Amrita Saha2 Caiming Xiong2 Doyen Sahoo2 1National University of Singapore 2Salesforce AI Research
Pseudocode	Yes	Algorithm 1 INITIALISATION(D, π) ... Algorithm 2 EXPERTITER(D, πei, R, Dval, π) ... Algorithm 3 AUTO-CEI(D, π, R, f, Dval)
Open Source Code	Yes	The code is available at https://github.com/Salesforce AIResearch/Auto-CEI.
Open Datasets	Yes	To demonstrate the effectiveness of AUTO-CEI in reasoning tasks, we select Boardgame QA (Kazemi et al., 2024), MATH (Hendrycks et al., 2021), and Blocksworld (Valmeekam et al., 2023) as benchmarks, spanning from logical and mathematical reasoning to planning. They have various domains and complexities. We briefly introduce the benchmarks and report our detailed experimental settings in the Appendix F. ... The dataset is downloaded from https://storage.googleapis.com/gresearch/BoardgameQA/BoardgameQA.zip ... We download the MATH dataset via Hugging Face 3. It automatically divide the dataset into training, validation, and test set. ... The Blocksworld dataset can be generated using the code in the Git Hub repository by Valmeekam et al. (2023)4.
Dataset Splits	Yes	For our case, we generate domains from 4 blocks to 6 blocks, and randomly samples 500 data points for training. We randomly sample 500 data points for validation and testing sets whose optimal solution length (i.e., ground truth plan) is no longer than ten steps. We uniformly sample the tasks according to the ground truth lengths to form the testing set (i.e., 100 two-step tasks, 100 four-step tasks, ..., and 100 ten-step tasks).
Hardware Specification	Yes	The experiments are conducted in a server with 8 Nvidia A100 (40GB) GPUs.
Software Dependencies	No	The paper mentions using specific software components like "Llama-3.1-8B-instruct", "Lora", "Deep Speed Stage 2", and "Hugging Face SFT trainer", but it does not provide specific version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	We use Llama-3.1-8B-instruct (Dubey et al., 2024) as the backbone model and use Lora (r = 128, α = 64) to fine-tune. ... During the SFT stage, we choose the batch size = 256, and epoch number is 5 for initial SFT process, and 2 for SFT in R-Tuning and EI process. The learning rate of SFT is 1 10 4. ... In practice, we use temperature=1.0 and top p = 0.95 to sample responses. ... c1 is initialised by the mean value of reasoning steps produced by the initial LLM policy in the validation set. c2 is computed by solving 1 exp( c2 2σ) 1+exp( c2 2σ) = 0.9... λ [0, 1] is a hyperparameter to control the tradeoff between hallucination and laziness...