Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Preserving Diversity in Supervised Fine-Tuning of Large Language Models
Authors: Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, Ruoyu Sun
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present experiment results validating the effectiveness of our proposed framework, demonstrating that GEM matches the downstream performance of CE while offering two distinct advantages: (1) its diversity preservation enables a wider range of outputs, enhancing test-time scaling performance, and (2) it mitigates forgetting, effectively reducing the alignment tax. Extended results are available in Appendix F. |
| Researcher Affiliation | Academia | 1The Chinese University of Hong Kong, Shenzhen 2Shenzhen Research Institute of Big Data 3Nanjing University 4Hong Kong University of Science and Technology 5University of Pennsylvania 6Shenzhen International Center for Industrial and Applied Mathematics EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | The overall algorithm is outlined in Algorithm 1, which incorporates two key features: single-model optimization and variance-reduced gradient estimation. ... Algorithm 1 GEM ... Algorithm 2 GEM for Sequential Data |
| Open Source Code | Yes | Code is available at https://github.com/liziniu/GEM. |
| Open Datasets | Yes | We fine-tune the pre-trained Llama-3.1-8B model with the Ultra Feedback dataset (Cui et al., 2024). We assess the model s chat ability using the bestof-N sampling strategy. We prompt the trained models to answer 805 questions from the Alpaca Eval dataset (Li et al., 2023). We consider the Human Eval (Chen et al., 2021) benchmark, in which the model is asked to generate Python codes for 163 questions, and the executor judges their correctness. |
| Dataset Splits | Yes | We fine-tune the pre-trained Llama-3.1-8B model with the Ultra Feedback dataset (Cui et al., 2024). ...The dataset contains 61,135 training samples and 1,000 test samples. For the chatting task, we use the 805 test questions from the Alpaca Eval dataset (Li et al., 2023). For the code generation task, there are 164 test questions for Human Eval. |
| Hardware Specification | Yes | All experiments are conducted using A800-80GB GPUs with the Deep Speed distributed training framework, utilizing Ze RO-2 and gradient checkpointing without offloading. |
| Software Dependencies | No | All experiments are conducted using A800-80GB GPUs with the Deep Speed distributed training framework, utilizing Ze RO-2 and gradient checkpointing without offloading. We use flash-attention-2 with deterministic backward for reproducibility. ...using Adam as the optimizer... |
| Experiment Setup | Yes | Set-up. We fine-tune the pre-trained Llama-3.1-8B model with the Ultra Feedback dataset (Cui et al., 2024). ...we set the learning rate to 2 10 5, employing a cosine learning rate decay schedule, and use a macro batch size of 128. The maximum sequence length, encompassing both the prompt and response, is set to 2,048 tokens. Models are trained for three epochs. Detailed experimental settings are described in Appendix E. Appendix E: All experiments are conducted using A800-80GB GPUs... The experiments are based on the pretrained Llama3-8B model, using Adam as the optimizer with a global batch size of 128. Following (Yu et al., 2023; Liu et al., 2023; Cui et al., 2024), the learning rate is set to 2 10 5, with a warm-up ratio of 0.03 and cosine learning rate decay. Training is performed over 3 epochs. |