reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Conformal Tail Risk Control for Large Language Model Alignment

Authors: Catherine Chen, Jingyan Shen, Zhun Deng, Lihua Lei

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we perform experiments to investigate the issue of human-machine misalignment by implementing our conformal distortion risk control method to mitigate toxicity of LLM-generated reseponses. This is a critical application, as toxic outputs may cause severely negative impacts on impressionable populations, moreover, propagate across wide audiences, leading to misinformation and harm. 4.1. Experimental setup Datasets and models. We randomly draw 10K prompts from the REALTOXICITYPROMPTS dataset (Gehman et al., 2020) . For each selected prompt xi, we generate 40 responses yj(xi) using the LLa MA2-7B model (Touvron et al., 2023). Given the initial responses, we apply the sequential algorithm described in Algorithm 2 to construct the candidate response sets C(xi), ensuring the quality of the selected responses. Specifically, we use perplexity (PPL) to evaluate response quality, ROUGE-L to assess similarity between responses, and restrict the maximum set size to 32 as a stopping criterion. More details can be found in Appendix A.1. 4.2. Results Realized risk and average cost analysis. Fig. 4 shows the realized CVa Rβ of human scores and the average sampling cost with ρ = 0.57 on the held-out dataset as functions of α for β {0.5, 0.75, 0.9} with \|D\| = 6000. The panels in the first row shows that all methods control the risk at the target level and CDRC-L is least conservative, as discussed in Section 3.4. As a result, it incurs the smallest deployment cost among all three methods. Moreover, as β increases, the advantage of CDRC-L is more prominent. Although CDRC-BJ improves upon CDRC-DKW due to the tighter bounds for extreme quantiles, it still underperforms CRDC-L.
Researcher Affiliation	Academia	1Institute for Computational and Mathematical Engineering, Stanford University, CA 2Department of Industrial Engineering and Operations Research, Columbia University, NY 3Department of Computer Science, UNC Chapel Hill, NC 4Graduate School of Business, Stanford University, CA 5Department of Statistics (by courtesy), Stanford University, CA. Correspondence to: Catherine Chen <EMAIL>.
Pseudocode	Yes	Algorithm 1 Conformal distortion risk control Algorithm 2 Generation of the candidate set C(x)
Open Source Code	Yes	Code repository can be found at https://github.com/jy-evangeline/DRC.
Open Datasets	Yes	We randomly draw 10K prompts from the REALTOXICITYPROMPTS dataset (Gehman et al., 2020) . For each selected prompt xi, we generate 40 responses yj(xi) using the LLa MA2-7B model (Touvron et al., 2023). Given the initial responses, we apply the sequential algorithm described in Algorithm 2 to construct the candidate response sets C(xi), ensuring the quality of the selected responses. Specifically, we use perplexity (PPL) to evaluate response quality, ROUGE-L to assess similarity between responses, and restrict the maximum set size to 32 as a stopping criterion. More details can be found in Appendix A.1. To apply our method, we need a human toxicity score function r( ) and a machine toxicity score function rm( ). Human-annotated data can be costly and time-consuming to acquire. To evaluate our method, we create a cheap semi-synthetic benchmark using an existing machine scoring model as the human annotator, and a biased model as the machine assessor. Specifically, we use the Detoxify model (Hanu & Unitary team, 2020) for r( ) and retrain the Detoxify model for rm( ) on a biased subset of the Jigsaw Unintended Bias in Toxicity Classification dataset (cjadams et al., 2019) that consists of the c% most and least toxic instances.
Dataset Splits	Yes	Evaluation. We randomly split the prompts, using \|D\| {50, 100, 200, 1000, 6000} to determine the optimal threshold ˆλ and the remaining as a held-out test dataset. For each method, after selecting ˆλ, we deploy the calibrated model on the held-out dataset.
Hardware Specification	Yes	We use the original Detoxify model as a proxy for human annotator scores, then finetune this base model with various sample sizes, a learning rate of 0.0001, a batch size of 16, and a weight decay of 3 10 6 on a single Nvidia A40 GPU.
Software Dependencies	No	The paper mentions using the LLaMA2-7B model, Detoxify model, and Adam optimizer. However, it does not specify version numbers for any software libraries (e.g., PyTorch, TensorFlow, scikit-learn) or the specific version of Detoxify or LLaMA models used beyond their names and publication years.
Experiment Setup	Yes	For each selected prompt xi, we generate 40 responses yj(xi) using the LLa MA2-7B model (Touvron et al., 2023). Given the initial responses, we apply the sequential algorithm described in Algorithm 2 to construct the candidate response sets C(xi), ensuring the quality of the selected responses. Specifically, we use perplexity (PPL) to evaluate response quality, ROUGE-L to assess similarity between responses, and restrict the maximum set size to 32 as a stopping criterion. More details can be found in Appendix A.1. Toxicity scores. To apply our method, we need a human toxicity score function r( ) and a machine toxicity score function rm( ). Human-annotated data can be costly and time-consuming to acquire. To evaluate our method, we create a cheap semi-synthetic benchmark using an existing machine scoring model as the human annotator, and a biased model as the machine assessor. Specifically, we use the Detoxify model (Hanu & Unitary team, 2020) for r( ) and retrain the Detoxify model for rm( ) on a biased subset of the Jigsaw Unintended Bias in Toxicity Classification dataset (cjadams et al., 2019) that consists of the c% most and least toxic instances. The goal is to design rm( ) with varying degrees of misalignment from r( ). In particular, we train three models for rm( ) with c% {15%, 30%, 70%}. The Spearman correlation coefficients are 0.57, 0.68, 0.78, respectively. Choices of parameters. We consider both CVa Rβ and Va Rβ control with β {0.5, 0.75, 0.9}. We fix the confidence parameter 1 δ = 0.95. To determine a reasonable target level α, we compute the empirical CVa Rq on human scores of all candidate responses with q {1%, 5%, 10%, 15%, 20%}; see Appendix A.3. This suggests a range of reasonable target levels. In particular, we consider α {0.15, 0.2, 0.25, 0.3, 0.35}. In Appendix A.1: Following the original implementation of LLAMA-2-7B-HF model, we set the generation temperature at 0.8 and the top-p parameter at 0.95. In Appendix A.2: We finetune this base model with various sample sizes, a learning rate of 0.0001, a batch size of 16, and a weight decay of 3 10 6 on a single Nvidia A40 GPU. The Adam optimizer was employed with α = 0.9, β = 0.999, and ϵ = 10 8.