Automatic Curriculum Expert Iteration for Reliable LLM Reasoning
Authors: Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare AUTO-CEI with various SOTA baselines across logical reasoning, mathematics, and planning tasks, where AUTO-CEI achieves superior alignment by effectively balancing assertiveness and conservativeness. We carried out comprehensive experiments in various reasoning benchmarks, and AUTO-CEI significantly outperforms the concurrent baseline methods, boosting precision by 10-24% while maintaining a relatively low refusal rate of 18-36% across diverse reasoning tasks in planning, logical and math reasoning. |
| Researcher Affiliation | Collaboration | Zirui Zhao1 Hanze Dong2 Amrita Saha2 Caiming Xiong2 Doyen Sahoo2 1National University of Singapore 2Salesforce AI Research |
| Pseudocode | Yes | Algorithm 1 INITIALISATION(D, π) ... Algorithm 2 EXPERTITER(D, πei, R, Dval, π) ... Algorithm 3 AUTO-CEI(D, π, R, f, Dval) |
| Open Source Code | Yes | The code is available at https://github.com/Salesforce AIResearch/Auto-CEI. |
| Open Datasets | Yes | To demonstrate the effectiveness of AUTO-CEI in reasoning tasks, we select Boardgame QA (Kazemi et al., 2024), MATH (Hendrycks et al., 2021), and Blocksworld (Valmeekam et al., 2023) as benchmarks, spanning from logical and mathematical reasoning to planning. They have various domains and complexities. We briefly introduce the benchmarks and report our detailed experimental settings in the Appendix F. ... The dataset is downloaded from https://storage.googleapis.com/gresearch/BoardgameQA/BoardgameQA.zip ... We download the MATH dataset via Hugging Face 3. It automatically divide the dataset into training, validation, and test set. ... The Blocksworld dataset can be generated using the code in the Git Hub repository by Valmeekam et al. (2023)4. |
| Dataset Splits | Yes | For our case, we generate domains from 4 blocks to 6 blocks, and randomly samples 500 data points for training. We randomly sample 500 data points for validation and testing sets whose optimal solution length (i.e., ground truth plan) is no longer than ten steps. We uniformly sample the tasks according to the ground truth lengths to form the testing set (i.e., 100 two-step tasks, 100 four-step tasks, ..., and 100 ten-step tasks). |
| Hardware Specification | Yes | The experiments are conducted in a server with 8 Nvidia A100 (40GB) GPUs. |
| Software Dependencies | No | The paper mentions using specific software components like "Llama-3.1-8B-instruct", "Lora", "Deep Speed Stage 2", and "Hugging Face SFT trainer", but it does not provide specific version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We use Llama-3.1-8B-instruct (Dubey et al., 2024) as the backbone model and use Lora (r = 128, α = 64) to fine-tune. ... During the SFT stage, we choose the batch size = 256, and epoch number is 5 for initial SFT process, and 2 for SFT in R-Tuning and EI process. The learning rate of SFT is 1 10 4. ... In practice, we use temperature=1.0 and top p = 0.95 to sample responses. ... c1 is initialised by the mean value of reasoning steps produced by the initial LLM policy in the validation set. c2 is computed by solving 1 exp( c2 2σ) 1+exp( c2 2σ) = 0.9... λ [0, 1] is a hyperparameter to control the tradeoff between hallucination and laziness... |