Taming Overconfidence in LLMs: Reward Calibration in RLHF
Authors: Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experimental results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University, 2Washington University in St. Louis, 3UC Berkeley EMAIL, 2EMAIL, EMAIL |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes methodologies using mathematical equations and diagrams, such as Figure 5, but not in a pseudocode format. |
| Open Source Code | Yes | Our code is publicly released. 1https://github.com/Sean Leng1/Reward-Calibration |
| Open Datasets | Yes | We use six datasets for evaluation: GSM8K (Cobbe et al., 2021), Commonsense QA (Talmor et al., 2019), Sci Q (Welbl et al., 2017), Object Counting from Big Bench (Srivastava et al., 2022), four Professional Knowledge datasets in MMLU (Hendrycks et al., 2020), and Truthful QA (Lin et al., 2021). |
| Dataset Splits | Yes | We use the test split, which contains 1319 samples. (GSM8K) We use the test split, containing 1,221 samples. (Commonsense QA) We use the test split for evaluation, which includes 1000 examples. (Sci Q) For Object Counting, we focus on one subset, Object Counting, which includes 1000 samples. For the Professional Knowledge category, we combine the test sets from four subsets: Professional Accounting, Professional Law, Professional Medicine, and Professional Teaching. We randomly select 20,480 prompts and integrate a confidence-query system prompt into 25% of single-turn prompts |
| Hardware Specification | Yes | All training experiments are conducted on four A100 GPUs, and evaluations are carried out on one A100 GPU. |
| Software Dependencies | No | The paper mentions using 'Open RLHF' for training and 'gpt-4o-2024-08-06' for parsing, but it does not specify version numbers for general software dependencies like programming languages, frameworks (e.g., PyTorch), or other libraries used in their implementation. |
| Experiment Setup | Yes | We employ Open RLHF (Hu et al., 2024) for reward model and RLHF training. All training experiments are conducted on four A100 GPUs, and evaluations are carried out on one A100 GPU. Table 6: Hyperparameters for Reward Modeling. Table 7: Hyperparameters for Calibrating Llama3-8B-crm and Mistral-7B-crm. Table 8: Hyperparameters for PPO Training. Table 9: Hyperparameters for DPO and CDPO Training. |