reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preserving Diversity in Supervised Fine-Tuning of Large Language Models

Authors: Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, Ruoyu Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present experiment results validating the effectiveness of our proposed framework, demonstrating that GEM matches the downstream performance of CE while offering two distinct advantages: (1) its diversity preservation enables a wider range of outputs, enhancing test-time scaling performance, and (2) it mitigates forgetting, effectively reducing the alignment tax. Extended results are available in Appendix F.
Researcher Affiliation	Academia	1The Chinese University of Hong Kong, Shenzhen 2Shenzhen Research Institute of Big Data 3Nanjing University 4Hong Kong University of Science and Technology 5University of Pennsylvania 6Shenzhen International Center for Industrial and Applied Mathematics EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	The overall algorithm is outlined in Algorithm 1, which incorporates two key features: single-model optimization and variance-reduced gradient estimation. ... Algorithm 1 GEM ... Algorithm 2 GEM for Sequential Data
Open Source Code	Yes	Code is available at https://github.com/liziniu/GEM.
Open Datasets	Yes	We fine-tune the pre-trained Llama-3.1-8B model with the Ultra Feedback dataset (Cui et al., 2024). We assess the model s chat ability using the bestof-N sampling strategy. We prompt the trained models to answer 805 questions from the Alpaca Eval dataset (Li et al., 2023). We consider the Human Eval (Chen et al., 2021) benchmark, in which the model is asked to generate Python codes for 163 questions, and the executor judges their correctness.
Dataset Splits	Yes	We fine-tune the pre-trained Llama-3.1-8B model with the Ultra Feedback dataset (Cui et al., 2024). ...The dataset contains 61,135 training samples and 1,000 test samples. For the chatting task, we use the 805 test questions from the Alpaca Eval dataset (Li et al., 2023). For the code generation task, there are 164 test questions for Human Eval.
Hardware Specification	Yes	All experiments are conducted using A800-80GB GPUs with the Deep Speed distributed training framework, utilizing Ze RO-2 and gradient checkpointing without offloading.
Software Dependencies	No	All experiments are conducted using A800-80GB GPUs with the Deep Speed distributed training framework, utilizing Ze RO-2 and gradient checkpointing without offloading. We use flash-attention-2 with deterministic backward for reproducibility. ...using Adam as the optimizer...
Experiment Setup	Yes	Set-up. We fine-tune the pre-trained Llama-3.1-8B model with the Ultra Feedback dataset (Cui et al., 2024). ...we set the learning rate to 2 10 5, employing a cosine learning rate decay schedule, and use a macro batch size of 128. The maximum sequence length, encompassing both the prompt and response, is set to 2,048 tokens. Models are trained for three epochs. Detailed experimental settings are described in Appendix E. Appendix E: All experiments are conducted using A800-80GB GPUs... The experiments are based on the pretrained Llama3-8B model, using Adam as the optimizer with a global batch size of 128. Following (Yu et al., 2023; Liu et al., 2023; Cui et al., 2024), the learning rate is set to 2 10 5, with a warm-up ratio of 0.03 and cosine learning rate decay. Training is performed over 3 epochs.