On Calibration of LLM-based Guard Models for Reliable Content Moderation
Authors: Hongfu Liu, Hengguan Huang, Xiangming Gu, Hao Wang, Ye Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we empirically conduct comprehensive investigations of confidence calibration for 9 existing LLM-based guard models on 12 benchmarks in both user input and model output classification. Our findings reveal that current LLM-based guard models tend to 1) produce overconfident predictions, 2) exhibit significant miscalibration when subjected to jailbreak attacks, and 3) demonstrate limited robustness to the outputs generated by different types of response models. Additionally, we assess the effectiveness of post-hoc calibration methods to mitigate miscalibration. We demonstrate the efficacy of temperature scaling and, for the first time, highlight the benefits of contextual calibration for confidence calibration of guard models, particularly in the absence of validation sets. |
| Researcher Affiliation | Academia | Hongfu Liu1 , Hengguan Huang2, Xiangming Gu1, Hao Wang3, Ye Wang1 1National University of Singapore 2University of Copenhagen 3Rutgers University |
| Pseudocode | No | The paper describes methods in text but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 5.1 describes "Calibration Techniques" but does not format them as pseudocode. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/Waffle-Liu/calibration guard model |
| Open Datasets | Yes | To assess calibration in the context of binary prompt classification, we evaluate performance using a range of public benchmarks, including Open AI Moderation (Markov et al., 2023), Toxic Chat Test (Lin et al., 2023), Aegis Safety Test (Ghosh et al., 2024), Simple Safety Tests (Vidgen et al., 2023), XSTest (R ottger et al., 2023), Harmbench Prompt (Mazeika et al., 2024) and Wild Guard Mix Test Prompt (Han et al., 2024). For the response classification, we utilize datasets containing Beaver Tails Test (Ji et al., 2024), Safe RLHF Test (Dai et al., 2023), Harmbench Response (Mazeika et al., 2024), and Wild Guard Mix Test Response (Han et al., 2024). |
| Dataset Splits | Yes | For temperature scaling, we utilize the XSTest set as the validation set to optimize the temperature due to its relatively small size. This optimized temperature value is then applied across all other datasets, as individual validation sets are not available for all examined datasets. Additional experiments using in-domain validation sets can be found in Appendix B.1. Harmbench Response (Mazeika et al., 2024). This dataset refers to a variant of the validation set used for fine-tuning Llama2-variant from Harmbench, which consists of 602 responses generated by various models and jailbreak attacks. We use the pairs of their vanilla prompts and model responses with human labeling for response classification, resulting in a set of 596 pairs. Beaver Tail Test (Ji et al., 2024). We utilize the test split of this dataset with 33.4k prompt-response pairs... We use a subset of 2k size randomly sampled from the original test split to reduce the evaluation cost. |
| Hardware Specification | Yes | We run all evaluations on a single NVIDIA A40 GPU (48G). |
| Software Dependencies | No | We use Pytorch and Huggingface Transformers in our implementation. Specific version numbers for Pytorch or Huggingface Transformers are not provided. |
| Experiment Setup | Yes | We use M = 15 bins as in Guo et al. (2017) for all our ECE evaluations. For temperature scaling, we optimize the T within the range from (0, 5]. For batch calibration, we set the batch size as the size of the entire test set by default following Zhou et al. (2023a). For prompt classification, we keep the original prompt lengths for most datasets except Open AI Moderation where we truncate a few samples with extremely long lengths to avoid the outof-memory error. We keep the maximum length as 1800. For response classification, we keep the original prompt length for all datasets and set the maximum response length as 500. |