reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Taming Overconfidence in LLMs: Reward Calibration in RLHF

Authors: Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experimental results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO.
Researcher Affiliation	Academia	1Carnegie Mellon University, 2Washington University in St. Louis, 3UC Berkeley EMAIL, 2EMAIL, EMAIL
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies using mathematical equations and diagrams, such as Figure 5, but not in a pseudocode format.
Open Source Code	Yes	Our code is publicly released. 1https://github.com/Sean Leng1/Reward-Calibration
Open Datasets	Yes	We use six datasets for evaluation: GSM8K (Cobbe et al., 2021), Commonsense QA (Talmor et al., 2019), Sci Q (Welbl et al., 2017), Object Counting from Big Bench (Srivastava et al., 2022), four Professional Knowledge datasets in MMLU (Hendrycks et al., 2020), and Truthful QA (Lin et al., 2021).
Dataset Splits	Yes	We use the test split, which contains 1319 samples. (GSM8K) We use the test split, containing 1,221 samples. (Commonsense QA) We use the test split for evaluation, which includes 1000 examples. (Sci Q) For Object Counting, we focus on one subset, Object Counting, which includes 1000 samples. For the Professional Knowledge category, we combine the test sets from four subsets: Professional Accounting, Professional Law, Professional Medicine, and Professional Teaching. We randomly select 20,480 prompts and integrate a confidence-query system prompt into 25% of single-turn prompts
Hardware Specification	Yes	All training experiments are conducted on four A100 GPUs, and evaluations are carried out on one A100 GPU.
Software Dependencies	No	The paper mentions using 'Open RLHF' for training and 'gpt-4o-2024-08-06' for parsing, but it does not specify version numbers for general software dependencies like programming languages, frameworks (e.g., PyTorch), or other libraries used in their implementation.
Experiment Setup	Yes	We employ Open RLHF (Hu et al., 2024) for reward model and RLHF training. All training experiments are conducted on four A100 GPUs, and evaluations are carried out on one A100 GPU. Table 6: Hyperparameters for Reward Modeling. Table 7: Hyperparameters for Calibrating Llama3-8B-crm and Mistral-7B-crm. Table 8: Hyperparameters for PPO Training. Table 9: Hyperparameters for DPO and CDPO Training.