reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention

Authors: Stephan Rabanser, Ali Shahin Shamsabadi, Olive Franzese, Xiao Wang, Adrian Weller, Nicolas Papernot

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate the following key contributions: Effectiveness of Mirage in inducing uncertainty; Effectiveness of Conﬁdential Guardian in detecting dishonest artiﬁcial; Efﬁciency of Conﬁdential Guardian in proving the ZK EEC constraint. We experiment on the following datasets: Synthetic Gaussian Mixture, Image Classiﬁcation (CIFAR-100, UTKFace), Tabular Data (Credit, Adult).
Researcher Affiliation	Collaboration	Stephan Rabanser 1 2 Ali Shahin Shamsabadi 3 Olive Franzese 2 Xiao Wang 4 Adrian Weller 5 6 Nicolas Papernot 1 2 1University of Toronto 2Vector Institute 3Brave Software 4Northwestern University 5University of Cambridge 6The Alan Turing Institute. Correspondence to: Stephan Rabanser <EMAIL>.
Pseudocode	Yes	Algorithm 1 Zero-Knowledge Proof of Well-Calibratedness
Open Source Code	Yes	We make our code available at https://github.com/ cleverhans-lab/confidential-guardian.
Open Datasets	Yes	Image Classiﬁcation (Figure 4). Extending beyond synthetic experiments we include results on image classiﬁcation datasets: CIFAR-100 (Krizhevsky et al., 2009) and UTKFace (Zhang et al., 2017). Tabular Data (Figure 5). Finally, we also test Mirage and Conﬁdential Guardian on two tabular datasets: Credit (Hofmann, 1994) and Adult (Becker & Kohavi, 1996; Ding et al., 2021).
Dataset Splits	No	The paper mentions using a 'full test set' for accuracy evaluation and a 'reference dataset Dref' for calibration checks. For the synthetic Gaussian Mixture, it specifies 'The dataset consists of 1,000 samples each from classes 1 and 2, and 100 samples from class 3.' However, it does not provide explicit training, validation, or test split percentages or sample counts for all datasets (CIFAR-100, UTKFace, Credit, Adult) in the main text, nor does it refer to standard splits with citations for all of them.
Hardware Specification	No	Benchmarks are run by locally simulating the prover and veriﬁer on a Mac Book Pro laptop with an M1 chip. This only specifies the hardware for ZKP benchmarks, not for the main model training and evaluation experiments. The paper generally refers to 'compute infrastructure' but without specific models or configurations for the main experiments.
Software Dependencies	No	We implement our ZK protocol in emp-toolkit and show that Conﬁdential Guardian achieves low runtime and communication costs. benchmarking an implementation in emp-toolkit (Wang et al., 2016). For the image classiﬁcation datasets, we estimate performance with a combination of emp-toolkit and Mystique (Weng et al., 2021b). While these tools are named and cited, no specific version numbers for emp-toolkit, Mystique, or any other core software libraries (e.g., Python, PyTorch/TensorFlow) are provided.
Experiment Setup	No	The model owner ﬁrst trains a baseline model fθ by minimizing the cross entropy loss LCE on the entire dataset, disregarding the uncertainty region. Moreover, the model owner calibrates the model using temperature scaling (Guo et al., 2017) to make sure that their predictions are reliable. Following this, the model owner then ﬁne-tunes their model using Mirage with a particular ε to reduce conﬁdence in a chosen uncertainty region only. Across our experiments, we found ε [0.1, 0.2] to deliver good results. While it describes the overall training strategy and some model architectures (ResNet-18, ResNet-50, shallow neural networks), it lacks specific hyperparameter values like learning rates, batch sizes, optimizers, or number of training epochs required for reproducibility.