ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement
Authors: XIANGYU PENG, Congying Xia, Xinyi Yang, Caiming Xiong, Chien-Sheng Wu, Chen Xing
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that Re Genesis achieves superior performance on all in-domain and OOD settings tested compared to existing methods. For six OOD tasks specifically, while previous methods exhibited an average performance decrease of approximately 4.6% after post training, Re Genesis delivers around 6.1% performance improvement. We also conduct in-depth analysis of our framework and show Re Genesis is effective across various LLMs and design choices. ... In this section, we assess the effectiveness of Re Genesis and existing methods in improving LLM’s reasoning capabilities for both in-domain tasks and OOD tasks. |
| Researcher Affiliation | Industry | Xiangyu Peng Congying Xia Xinyi Yang Caiming Xiong Chien-Sheng Wu Chen Xing Salesforce AI Research EMAIL |
| Pseudocode | No | The paper describes the method's steps in paragraph text and uses a diagram (Figure 1) to illustrate the process, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | Yes | We selected the following five datasets for training, including mathematic, logical and common sense reasoning tasks. The training sets of these datasets are used to generate self-synthetic reasoning paths and to fine-tune the language models. The test sets are employed to evaluate the in-domain performance of the fine-tuned models. Mathematic reasoning: GSM8K math problem dataset (Cobbe et al., 2021) and Num GLUE dataset (Mishra et al., 2022). ... Logical reasoning: We use logical reasoning dataset Re Clor (Yu et al., 2020). Commonsense reasoning: AI2 Reasoning Challenge (ARC) (Clark et al., 2018) and Strategy QA (Geva et al., 2021) dataset. For ARC, we specifically use only the Challenge subset (ARC-c). ... Mathematic reasoning: We use ASDIV (Miao et al., 2020), SVAMP (Patel et al., 2021) and the AQUA-RAT (Algebra Question Answering with Rationales) (Ling et al., 2017) datasets. Logical reasoning: BIG-Bench Hard (BBH) (Suzgun et al., 2023) dataset, a subset of BIG-Bench. Natural Language Inference (NLI): We utilize the Adversarial NLI (ANLI) (Mihaylov et al., 2018b) subsets ANLI-A2 and ANLI-A3. ... Commonsense Reasoning: We use Open Book QA (Mihaylov et al., 2018a). |
| Dataset Splits | Yes | We selected the following five datasets for training... The training sets of these datasets are used to generate self-synthetic reasoning paths and to fine-tune the language models. The test sets are employed to evaluate the in-domain performance of the fine-tuned models. ... Finally, we randomly select up to p = 5 reasoning paths per question as target outputs. ... To integrate the datasets of two, three, and four tasks, we randomly choose 3,500, 2,333, and 1,750 instruction-reasoning-paths-answer pairs from two, three, and four of these five datasets, respectively. |
| Hardware Specification | Yes | utilizing an 8-GPU node of A100 GPUs, each with 40GB of memory. |
| Software Dependencies | No | The paper mentions specific models like Mistral-7B-Instruct-v0.3 and Meta-Llama-3-8B-Instruct, and frameworks like vLLM and Llamafactory, but it does not specify version numbers for general software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The final step involves fine-tuning the Mistral-7B-Instruct-v0.3 model using the generated reasoning paths, with a learning rate of 1e-6 and training for 3 epochs and a batch size of 16, utilizing an 8-GPU node of A100 GPUs, each with 40GB of memory. More details can be found in Appendix A.5.4. ... Table 21: Fine-Tuning with Ours: Training Parameters. learning rate 1e-6, epochs 3, batch size 16, weight decay 0.1, lr scheduler type cosine, warmup ratio 0.03. |