Importance Weighting Can Help Large Language Models Self-Improve

Authors: Chunyang Jiang, Chi-Min Chan, Wei Xue, Qifeng Liu, Yike Guo

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that with only a tiny valid set (up to 5% size of the training set) to compute DS weight, our approach can notably promote the reasoning ability of current LLM self-improvement methods. The resulting performance is on par with methods that rely on external supervision from pre-trained reward models. ... We conduct experiments on six datasets across three types of tasks: Arithmetic Reasoning: gsm8k (Cobbe et al. 2021) and SVAMP (Patel, Bhattamishra, and Goyal 2021). Natural Language Inference: Adversarial NLI subsets (Nie et al. 2020). ANLI-A1 and ANLI-A2 subsets are used. Commonsense Reasoning: Open Book QA (Mihaylov et al. 2018) and Strategy QA (Geva et al. 2021). ... The main comparison results are shown in Table 1. The evaluation metric is accuracy percent and all results are derived by greedy decoding.
Researcher Affiliation Academia Chunyang Jiang, Chi-Min Chan, Wei Xue B, Qifeng Liu, Yike Guo B Hong Kong University of Science and Technology EMAIL, EMAIL
Pseudocode No The paper describes the methodology and workflow using descriptive text and a figure (Figure 1), but does not contain a dedicated pseudocode block or algorithm section with structured steps like code.
Open Source Code Yes The source code and supplementary materials are available at https://github.com/rubickkcibur/IWSI.
Open Datasets Yes We conduct experiments on six datasets across three types of tasks: Arithmetic Reasoning: gsm8k (Cobbe et al. 2021) and SVAMP (Patel, Bhattamishra, and Goyal 2021). Natural Language Inference: Adversarial NLI subsets (Nie et al. 2020). ANLI-A1 and ANLI-A2 subsets are used. Commonsense Reasoning: Open Book QA (Mihaylov et al. 2018) and Strategy QA (Geva et al. 2021).
Dataset Splits Yes The size of valid sets varies among different datasets, but none of them exceeds 5% size of the corresponding training set. Appendix A provides more details about the split and statistics of all datasets.
Hardware Specification Yes All training process is performed on eight RTX-4090 GPUs.
Software Dependencies No We use Lo RA (Hu et al. 2022) to do fine-tuning. We use Adam W (Loshchilov and Hutter 2019) optimizer and the learning rate is 3e-4. The paper mentions specific tools (LoRA, AdamW) but does not provide version numbers for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch), or CUDA.
Experiment Setup Yes We select Llama3-8B as our base model (Touvron et al. 2023). For each question, we generate 15 candidates, with temperature 𝑇= 1.1. ... The training batch size per device is set to 1 and the gradient accumulation steps is 4. We use Lo RA (Hu et al. 2022) to do fine-tuning. We use Adam W (Loshchilov and Hutter 2019) optimizer and the learning rate is 3e-4. ... For fairness, we universally set the filtering percentage 𝑘= 80 for IWSI, Entropy-filter, Self-filter, and RM-filter.