reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Importance Weighting Can Help Large Language Models Self-Improve

Authors: Chunyang Jiang, Chi-Min Chan, Wei Xue, Qifeng Liu, Yike Guo

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that with only a tiny valid set (up to 5% size of the training set) to compute DS weight, our approach can notably promote the reasoning ability of current LLM self-improvement methods. The resulting performance is on par with methods that rely on external supervision from pre-trained reward models. ... We conduct experiments on six datasets across three types of tasks: Arithmetic Reasoning: gsm8k (Cobbe et al. 2021) and SVAMP (Patel, Bhattamishra, and Goyal 2021). Natural Language Inference: Adversarial NLI subsets (Nie et al. 2020). ANLI-A1 and ANLI-A2 subsets are used. Commonsense Reasoning: Open Book QA (Mihaylov et al. 2018) and Strategy QA (Geva et al. 2021). ... The main comparison results are shown in Table 1. The evaluation metric is accuracy percent and all results are derived by greedy decoding.
Researcher Affiliation	Academia	Chunyang Jiang, Chi-Min Chan, Wei Xue B, Qifeng Liu, Yike Guo B Hong Kong University of Science and Technology EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology and workflow using descriptive text and a figure (Figure 1), but does not contain a dedicated pseudocode block or algorithm section with structured steps like code.
Open Source Code	Yes	The source code and supplementary materials are available at https://github.com/rubickkcibur/IWSI.
Open Datasets	Yes	We conduct experiments on six datasets across three types of tasks: Arithmetic Reasoning: gsm8k (Cobbe et al. 2021) and SVAMP (Patel, Bhattamishra, and Goyal 2021). Natural Language Inference: Adversarial NLI subsets (Nie et al. 2020). ANLI-A1 and ANLI-A2 subsets are used. Commonsense Reasoning: Open Book QA (Mihaylov et al. 2018) and Strategy QA (Geva et al. 2021).
Dataset Splits	Yes	The size of valid sets varies among different datasets, but none of them exceeds 5% size of the corresponding training set. Appendix A provides more details about the split and statistics of all datasets.
Hardware Specification	Yes	All training process is performed on eight RTX-4090 GPUs.
Software Dependencies	No	We use Lo RA (Hu et al. 2022) to do fine-tuning. We use Adam W (Loshchilov and Hutter 2019) optimizer and the learning rate is 3e-4. The paper mentions specific tools (LoRA, AdamW) but does not provide version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch), or CUDA.
Experiment Setup	Yes	We select Llama3-8B as our base model (Touvron et al. 2023). For each question, we generate 15 candidates, with temperature 𝑇= 1.1. ... The training batch size per device is set to 1 and the gradient accumulation steps is 4. We use Lo RA (Hu et al. 2022) to do fine-tuning. We use Adam W (Loshchilov and Hutter 2019) optimizer and the learning rate is 3e-4. ... For fairness, we universally set the filtering percentage 𝑘= 80 for IWSI, Entropy-filter, Self-filter, and RM-filter.