reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement

Authors: XIANGYU PENG, Congying Xia, Xinyi Yang, Caiming Xiong, Chien-Sheng Wu, Chen Xing

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Re Genesis achieves superior performance on all in-domain and OOD settings tested compared to existing methods. For six OOD tasks specifically, while previous methods exhibited an average performance decrease of approximately 4.6% after post training, Re Genesis delivers around 6.1% performance improvement. We also conduct in-depth analysis of our framework and show Re Genesis is effective across various LLMs and design choices. ... In this section, we assess the effectiveness of Re Genesis and existing methods in improving LLM’s reasoning capabilities for both in-domain tasks and OOD tasks.
Researcher Affiliation	Industry	Xiangyu Peng Congying Xia Xinyi Yang Caiming Xiong Chien-Sheng Wu Chen Xing Salesforce AI Research EMAIL
Pseudocode	No	The paper describes the method's steps in paragraph text and uses a diagram (Figure 1) to illustrate the process, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets	Yes	We selected the following five datasets for training, including mathematic, logical and common sense reasoning tasks. The training sets of these datasets are used to generate self-synthetic reasoning paths and to fine-tune the language models. The test sets are employed to evaluate the in-domain performance of the fine-tuned models. Mathematic reasoning: GSM8K math problem dataset (Cobbe et al., 2021) and Num GLUE dataset (Mishra et al., 2022). ... Logical reasoning: We use logical reasoning dataset Re Clor (Yu et al., 2020). Commonsense reasoning: AI2 Reasoning Challenge (ARC) (Clark et al., 2018) and Strategy QA (Geva et al., 2021) dataset. For ARC, we specifically use only the Challenge subset (ARC-c). ... Mathematic reasoning: We use ASDIV (Miao et al., 2020), SVAMP (Patel et al., 2021) and the AQUA-RAT (Algebra Question Answering with Rationales) (Ling et al., 2017) datasets. Logical reasoning: BIG-Bench Hard (BBH) (Suzgun et al., 2023) dataset, a subset of BIG-Bench. Natural Language Inference (NLI): We utilize the Adversarial NLI (ANLI) (Mihaylov et al., 2018b) subsets ANLI-A2 and ANLI-A3. ... Commonsense Reasoning: We use Open Book QA (Mihaylov et al., 2018a).
Dataset Splits	Yes	We selected the following five datasets for training... The training sets of these datasets are used to generate self-synthetic reasoning paths and to fine-tune the language models. The test sets are employed to evaluate the in-domain performance of the fine-tuned models. ... Finally, we randomly select up to p = 5 reasoning paths per question as target outputs. ... To integrate the datasets of two, three, and four tasks, we randomly choose 3,500, 2,333, and 1,750 instruction-reasoning-paths-answer pairs from two, three, and four of these five datasets, respectively.
Hardware Specification	Yes	utilizing an 8-GPU node of A100 GPUs, each with 40GB of memory.
Software Dependencies	No	The paper mentions specific models like Mistral-7B-Instruct-v0.3 and Meta-Llama-3-8B-Instruct, and frameworks like vLLM and Llamafactory, but it does not specify version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	The final step involves fine-tuning the Mistral-7B-Instruct-v0.3 model using the generated reasoning paths, with a learning rate of 1e-6 and training for 3 epochs and a batch size of 16, utilizing an 8-GPU node of A100 GPUs, each with 40GB of memory. More details can be found in Appendix A.5.4. ... Table 21: Fine-Tuning with Ours: Training Parameters. learning rate 1e-6, epochs 3, batch size 16, weight decay 0.1, lr scheduler type cosine, warmup ratio 0.03.