On Calibration of LLM-based Guard Models for Reliable Content Moderation

Authors: Hongfu Liu, Hengguan Huang, Xiangming Gu, Hao Wang, Ye Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we empirically conduct comprehensive investigations of confidence calibration for 9 existing LLM-based guard models on 12 benchmarks in both user input and model output classification. Our findings reveal that current LLM-based guard models tend to 1) produce overconfident predictions, 2) exhibit significant miscalibration when subjected to jailbreak attacks, and 3) demonstrate limited robustness to the outputs generated by different types of response models. Additionally, we assess the effectiveness of post-hoc calibration methods to mitigate miscalibration. We demonstrate the efficacy of temperature scaling and, for the first time, highlight the benefits of contextual calibration for confidence calibration of guard models, particularly in the absence of validation sets.
Researcher Affiliation Academia Hongfu Liu1 , Hengguan Huang2, Xiangming Gu1, Hao Wang3, Ye Wang1 1National University of Singapore 2University of Copenhagen 3Rutgers University
Pseudocode No The paper describes methods in text but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 5.1 describes "Calibration Techniques" but does not format them as pseudocode.
Open Source Code Yes Our code is publicly available at https://github.com/Waffle-Liu/calibration guard model
Open Datasets Yes To assess calibration in the context of binary prompt classification, we evaluate performance using a range of public benchmarks, including Open AI Moderation (Markov et al., 2023), Toxic Chat Test (Lin et al., 2023), Aegis Safety Test (Ghosh et al., 2024), Simple Safety Tests (Vidgen et al., 2023), XSTest (R ottger et al., 2023), Harmbench Prompt (Mazeika et al., 2024) and Wild Guard Mix Test Prompt (Han et al., 2024). For the response classification, we utilize datasets containing Beaver Tails Test (Ji et al., 2024), Safe RLHF Test (Dai et al., 2023), Harmbench Response (Mazeika et al., 2024), and Wild Guard Mix Test Response (Han et al., 2024).
Dataset Splits Yes For temperature scaling, we utilize the XSTest set as the validation set to optimize the temperature due to its relatively small size. This optimized temperature value is then applied across all other datasets, as individual validation sets are not available for all examined datasets. Additional experiments using in-domain validation sets can be found in Appendix B.1. Harmbench Response (Mazeika et al., 2024). This dataset refers to a variant of the validation set used for fine-tuning Llama2-variant from Harmbench, which consists of 602 responses generated by various models and jailbreak attacks. We use the pairs of their vanilla prompts and model responses with human labeling for response classification, resulting in a set of 596 pairs. Beaver Tail Test (Ji et al., 2024). We utilize the test split of this dataset with 33.4k prompt-response pairs... We use a subset of 2k size randomly sampled from the original test split to reduce the evaluation cost.
Hardware Specification Yes We run all evaluations on a single NVIDIA A40 GPU (48G).
Software Dependencies No We use Pytorch and Huggingface Transformers in our implementation. Specific version numbers for Pytorch or Huggingface Transformers are not provided.
Experiment Setup Yes We use M = 15 bins as in Guo et al. (2017) for all our ECE evaluations. For temperature scaling, we optimize the T within the range from (0, 5]. For batch calibration, we set the batch size as the size of the entire test set by default following Zhou et al. (2023a). For prompt classification, we keep the original prompt lengths for most datasets except Open AI Moderation where we truncate a few samples with extremely long lengths to avoid the outof-memory error. We keep the maximum length as 1800. For response classification, we keep the original prompt length for all datasets and set the maximum response length as 500.