reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aligning Language Models Using Follow-up Likelihood as Reward Signal

Authors: Chen Zhang, Dading Chong, Feng Jiang, Chengguang Tang, Anningzhe Gao, Guohua Tang, Haizhou Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents the results and analysis of the reward modeling and alignment experiments. Main Benchmark Results Table 3 presents the accuracy scores (%) and Pearson correlations of various reward models on 8 pairwise preference datasets and 4 rating-based helpfulness datasets respectively. First, we observe that FLR performs significantly better than directly using the response likelihood. For instance, the average pairwise preference accuracy of FLR (Llama-3-8B-Instruct) is 66.44%, compared to 56.26% for Direct (Llama-3-8B-Instruct), showing a roughly 10% difference. Additionally, the average Pearson correlation achieved by FLR (Llama-3-8BInstruct) is 0.327 points higher than that of Direct (Llama-3-8BInstruct).
Researcher Affiliation	Collaboration	Chen Zhang1, Dading Chong2, Feng Jiang3,4,5* Chengguang Tang6, Anningzhe Gao4, Guohua Tang6, Haizhou Li1,3,4 1National University of Singapore, Singapore 2Peking University, China 3The Chinese University of Hong Kong, Shenzhen, China 4Shenzhen Research Institute of Big Data, China 5University of Science and Technology of China, China 6Tencent AI Lab, China EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology using mathematical formulations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 3 'Methodology' outlines the FLR procedure and alignment process in paragraph form.
Open Source Code	Yes	Repository available at https://github.com/e0397123/FLR.
Open Datasets	Yes	For M, we experiment with Llama-3-8B-Instruct and Qwen2-7B-Instruct. We utilize the Ultra Feedback (Cui et al. 2023) dataset for rewriting, which is then employed to finetune M. ... We adopt the Nectar dataset (Zhu et al. 2023) as the prompt source. The 183K prompts in Nectar are a mixture of diverse sources, including lmsys-chat-1M (Zheng et al. 2023b), Share GPT, Anthropic/HH-RLHF (Bai et al. 2022), Ultra Feedback, Evol-Instruct (Xu et al. 2023b), and Flan (Longpre et al. 2023). ... We assess the reward models using eight pairwise preference benchmarks and four rating-based single-response benchmarks, with their statistics in Table 2. Accuracy is reported for the pairwise benchmarks and Pearson correlation for the rating-based benchmarks. Additionally, we examine the reward models ability to rank different LLMs by providing system-level correlations between the reward model scores and real-user ELO ratings from the LMSys Chatbot Arena (Chiang et al. 2024). We obtain the Arena dataset4 from Lin et al. (2024) and it consists of 865 data instances per LLM for a total of 30 LLMs, including GPT4-turbo (Open AI 2023), Meta-Llama-3-70B-Instruct, and Mixtral-8x7B-Instruct (Jiang et al. 2024). To evaluate FLR s contribution to helpfulness alignment, we use well-established benchmarks including Alpaca-Eval V2 (Li et al. 2023), Wild Bench V2 (Lin et al. 2024), and FLASK (Ye et al. 2024).
Dataset Splits	Yes	Ultra Feedback contains 255,548 instances of (pi, ri, li) and we randomly sample 100K for fine-tuning. ... For our experiments, we randomly sampled 100K instruction prompts from Nectar. ... We assess the reward models using eight pairwise preference benchmarks and four rating-based single-response benchmarks, with their statistics in Table 2. ... Alpaca-Eval V2, Wild Bench V2, and FLASK contain 805, 1024, and 1700 instruction prompts respectively. Wild Bench V2 involves pairwise comparisons with the base policy model as the reference. For Alpaca-Eval V2, we report the length-controlled win rate against GPT-4 Preview (11/06) as per standard protocol, and the alpaca eval gpt4 turbo fn annotator config is adopted. For FLASK, the GPT-4 evaluator is adopted to rate the quality of the model responses according to specific prompts.
Hardware Specification	Yes	LoRA (Hu et al. 2022) is applied to all fine-tuning experiments on a single NVIDIA A100 80GB GPU.
Software Dependencies	No	The paper mentions using specific models like Llama-3-8B-Instruct and Qwen2-7B-Instruct, as well as frameworks such as LLaMA-Factory (Zheng et al. 2024), and techniques like DPO and KTO fine-tuning. However, it does not specify any version numbers for these software components or other libraries (e.g., Python, PyTorch versions). Although LLaMA-Factory is mentioned, its version is not provided.
Experiment Setup	No	The experimental settings for DPO and KTO fine-tuning follow the implementations from LLaMA-Factory (Zheng et al. 2024). ... For our experiments, we randomly sampled 100K instruction prompts from Nectar. Full details on reproducibitlity can be found in Appendix E. The paper refers to an external implementation for experimental settings and mentions that full details are in an appendix, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations within the main text.