Certifying Counterfactual Bias in LLMs
Authors: Isha Chaudhary, Qian Hu, Manoj Kumar, Morteza Ziyadi, Rahul Gupta, Gagandeep Singh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS We used 2 A100 GPUs, each with 40GB VRAM. We derive the queries on which the specifications from the 3 prefix distributions presented in Section 4 are pivoted, from popular datasets for fairness and bias assessment BOLD (Dhamala et al., 2021) and Decoding Trust (Wang et al., 2024). |
| Researcher Affiliation | Collaboration | 1 UIUC, 2 Amazon, 3 Oracle Health |
| Pseudocode | Yes | Algorithm 1 Prefix specification Input: L, Q; Output: C( , D, L) ... Algorithm 2 Make random prefix ... Algorithm 3 Make mixture of jailbreak prefix ... Algorithm 4 Make soft prefix |
| Open Source Code | Yes | Our implementation is available at https://github.com/uiuc-focal-lab/LLMCert-B and we provide guidelines for using our framework for practitioners in Appendix A. |
| Open Datasets | Yes | We derive the queries on which the specifications from the 3 prefix distributions presented in Section 4 are pivoted, from popular datasets for fairness and bias assessment BOLD (Dhamala et al., 2021) and Decoding Trust (Wang et al., 2024). |
| Dataset Splits | Yes | BOLD setup. BOLD is a dataset of partial sentences to demonstrate bias in the generations of LLMs in common situations. We pick a test set of 250 samples randomly from BOLD s profession partition and demonstrate binary gender bias specifications and certificates on it. ... Decoding Trust setup. ... We make specifications from all 48 statements in the stereotypes partition for demographic groups corresponding to race (black/white). |
| Hardware Specification | Yes | We used 2 A100 GPUs, each with 40GB VRAM. |
| Software Dependencies | No | The paper mentions open-sourcing its implementation and refers to general LLMs like GPT-4, Llama-2-chat, etc., but does not provide specific version numbers for software components like Python, PyTorch, or CUDA used in their own implementation of LLMCert-B. |
| Experiment Setup | Yes | The values of the certification parameters used in our experiments are given in Table 2 (Appendix E). We study their effect on the certification results with an ablation study in Appendix E. We generate the certification bounds with 95% confidence and 50 samples. |