Taming Overconfidence in LLMs: Reward Calibration in RLHF

Authors: Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experimental results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO.
Researcher Affiliation Academia 1Carnegie Mellon University, 2Washington University in St. Louis, 3UC Berkeley EMAIL, 2EMAIL, EMAIL
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies using mathematical equations and diagrams, such as Figure 5, but not in a pseudocode format.
Open Source Code Yes Our code is publicly released. 1https://github.com/Sean Leng1/Reward-Calibration
Open Datasets Yes We use six datasets for evaluation: GSM8K (Cobbe et al., 2021), Commonsense QA (Talmor et al., 2019), Sci Q (Welbl et al., 2017), Object Counting from Big Bench (Srivastava et al., 2022), four Professional Knowledge datasets in MMLU (Hendrycks et al., 2020), and Truthful QA (Lin et al., 2021).
Dataset Splits Yes We use the test split, which contains 1319 samples. (GSM8K) We use the test split, containing 1,221 samples. (Commonsense QA) We use the test split for evaluation, which includes 1000 examples. (Sci Q) For Object Counting, we focus on one subset, Object Counting, which includes 1000 samples. For the Professional Knowledge category, we combine the test sets from four subsets: Professional Accounting, Professional Law, Professional Medicine, and Professional Teaching. We randomly select 20,480 prompts and integrate a confidence-query system prompt into 25% of single-turn prompts
Hardware Specification Yes All training experiments are conducted on four A100 GPUs, and evaluations are carried out on one A100 GPU.
Software Dependencies No The paper mentions using 'Open RLHF' for training and 'gpt-4o-2024-08-06' for parsing, but it does not specify version numbers for general software dependencies like programming languages, frameworks (e.g., PyTorch), or other libraries used in their implementation.
Experiment Setup Yes We employ Open RLHF (Hu et al., 2024) for reward model and RLHF training. All training experiments are conducted on four A100 GPUs, and evaluations are carried out on one A100 GPU. Table 6: Hyperparameters for Reward Modeling. Table 7: Hyperparameters for Calibrating Llama3-8B-crm and Mistral-7B-crm. Table 8: Hyperparameters for PPO Training. Table 9: Hyperparameters for DPO and CDPO Training.