reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Statistical Hypothesis Testing for Auditing Robustness in Language Models

Authors: Paulius Rauba, Qiyao Wei, Mihaela Van Der Schaar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models.
Researcher Affiliation	Academia	1University of Cambridge. Correspondence to: Paulius Rauba <EMAIL>.
Pseudocode	Yes	Algorithm 1 Permutation Testing for Distribution-based Perturbation Analysis
Open Source Code	Yes	1Code can be found at https://github.com/vanderschaarlab/dbpa
Open Datasets	No	The paper describes creating healthcare prompts with patient varying patient features and using LLMs to generate responses, but does not provide concrete access information (link, DOI, or specific citation) for a publicly available, pre-existing dataset used in their experiments.
Dataset Splits	No	The paper uses Monte Carlo sampling to generate outputs for analysis and permutation testing, rather than traditional dataset splits (training, testing, validation) for model training or evaluation.
Hardware Specification	No	The paper does not specify the hardware (e.g., specific GPU or CPU models, memory, or cloud instance types) used to run the experiments.
Software Dependencies	No	The paper mentions using 'ada-002 for most experiments' and 'Open AI embedding models' but does not provide a list of specific software dependencies with version numbers (e.g., Python version, library names with version numbers) required to replicate the experiments.
Experiment Setup	Yes	By default, we run the experiment over 5 seeds, and report the mean and standard deviation of the measurements. We calculate the distance measure ω, computed as the JSD distance between the null and alternative distributions, and the p-values. We define the finite sample approximations of the output distributions for an input x X and its perturbation x as: ˆDx = {yi}k i=1, yi i.i.d. S(x), ˆDx = {y i}k i=1, y i i.i.d. S(x ) where k is the sample size. Algorithm 1... Require: Pooled vector Z = (z1, ..., z2k), similarity function s, discrepancy measure ω, number of permutations B.