reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Conformal Outlier Detection under Contaminated Reference Data

Authors: Meshi Bashari, Matteo Sesia, Yaniv Romano

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, in Section 4, we empirically validate our theory and proposed data-cleaning approach through comprehensive experiments on real-world datasets. The experiments confirm that conformal inference with contaminated data tends to be conservative. Furthermore, they demonstrate that our method significantly boosts power, particularly when the target type-I error rate is low and the number of outliers in the contaminated set is small.
Researcher Affiliation	Academia	1Department of Electrical and Computer Engineering, Technion IIT, Haifa, Israel 2Department of Data Sciences and Operations, University of Southern California, Los Angeles, California, USA 3Department of Computer Science, University of Southern California, Los Angeles, California, USA 4Department of Computer Science, Technion IIT, Haifa, Israel. Correspondence to: Meshi Bashari <EMAIL>.
Pseudocode	Yes	Algorithms 1 and 2 summarize this procedure, which intuitively offers advantages over both the standard method for computing ˆpn+1 in (2), by potentially increasing power, and the Naive-Trim approach, by mitigating the risk of over-correcting ˆpn+1. A schematic illustration of the construction of DLT cal by Algorithm 1 is shown in Figure 2. Algorithm 1 Label-trim calibration (construction phase) 1: Input: labeling budget m; contaminate calibration-set Dcal = {Xi}n i=1; score function s( ), obtained by a pre-trained outlier detection model; 2: Compute the calibration scores Si = s(Xi), i Dcal. 3: Sort the calibration scores, such that Sπ(1) Sπ(n) where π : [n] [n] is the corresponding permutation of the indices. 4: Annotate the m largest scores Dlabeled:= {(Sπ(i), Yπ(i)) : i > n m}, with Yπ(i) = 0 if Xπ(i) is an inlier and Yπ(i) = 1 otherwise. 5: Construct the trimmed calibration set DLT cal = {π(i) : i n m} {j : j Dlabeled and Yj = 0}. 6: Output: trimmed calibration set DLT cal. Algorithm 2 Label-trim calibration (testing phase) 1: Input: test point Xn+1; score function s; trimmed calibration set DLT cal; type-I error level α; 2: Compute the conformal p-value ˆp LT n+1 according to (4). 3: Output: reject the null hypothesis H0 if ˆp LT n+1 α, classifying Xn+1 as an outlier.
Open Source Code	Yes	A software that implements the proposed method is available at https://github.com/Meshiba/robust-conformal-od.
Open Datasets	Yes	We turn to evaluate the performance of conformal outlier detection methods under contaminated data. The experiments presented in this section are conducted on nine benchmark datasets: three tabular datasets, listed in Section 4.1, and six visual datasets, listed in Section 4.2. ... Specifically, the outliers are drawn from (1) MNIST (Deng, 2012), (2) SVHN (Netzer et al., 2011), (3) Texture (Cimpoi et al., 2014), (4) Places365 (Cimpoi et al., 2014), (5) Tiny Image Net (Torralba et al., 2008), and (6) CIFAR100 (Krizhevsky et al., 2009). ... KDD Cup 1999 Data Set. https://www.kaggle.com/mlg-ulb/creditcardfraud. ... Credit Card Fraud Detection Data Set. https://www.kaggle.com/mlg-ulb/creditcardfraud. ... Statlog (Shuttle) Data Set. http://odds.cs.stonybrook.edu/shuttle-dataset.
Dataset Splits	Yes	In all experiments, we randomly split a given dataset into disjoint training Dtrain, calibration Dcal, and test sets of inliers Dinlier test and outliers Doutlier test . To simulate a realistic setting, we construct the training and contaminated calibration sets with the same contamination rate of r%. The inlier Dinlier test and outlier Doutlier test test sets are used to compute the type-I error and power of the outlier detection model, respectively. To ensure fair comparisons, all conformal methods use the same outlier detection model, trained on Dtrain. Performance metrics are evaluated across 100 random splits of the data. The size of each dataset, along with the details of how Dtrain, Dcal, and Dtest are constructed are provided in Appendix B.1. ... Specifically, for the shuttle and KDDCup99 datasets, the train set contains 5,000 samples, and the calibration set contains 2,500 samples, both with a contamination rate of r = 3%, unless stated otherwise. The inlier and outlier test sets consist of 950 and 50 samples, respectively. For the credit-card dataset, the train set contains 2,000 samples, while the calibration and test sets follow the same setup as the Shuttle and KDDCup99 datasets. ... The sizes of the train and calibration sets are 2,000 and 3,000, respectively, with the same contamination rate. The inlier and outlier test sets consist of 950 and 50 samples, respectively.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or cloud computing resources used for running its experiments.
Software Dependencies	No	For all conformal methods, we use Isolation Forest (Liu et al., 2008) as the base outlier detection model, implemented using scikit-learn with default hyperparameters (Buitinck et al., 2013). ... We also include an additional result (Corollary A.3) in Appendix A, which reaches qualitatively similar conclusions by adopting an approach more closely aligned with Theorem 1 from Sesia et al. (2024). ... For all datasets, we use the outlier detection model proposed by Sun et al. (2021), Re Act, which operates on feature representations extracted by a pre-trained Res Net-18 model. ... Re Act with VGG-19: Same as above, but with a pre-trained VGG-19 backbone (Chen) instead of Res Net-18. ... SCALE with Res Net-18: SCALE (Xu et al., 2024) operates on feature representations extracted from a pre-trained Res Net-18 model (Zhang et al., 2024; He et al., 2016). ... Here, we consider two additional outlier detection models: Local Outlier Factor (LOF) with 100 estimators and One-Class Support Vector Machine (OC-SVM) with an RBF kernel, both implemented via scikit-learn. While various software tools and libraries are mentioned, specific version numbers for these components (e.g., scikit-learn version, Python version) are not provided.
Experiment Setup	Yes	For all conformal methods, we use Isolation Forest (Liu et al., 2008) as the base outlier detection model, implemented using scikit-learn with default hyperparameters (Buitinck et al., 2013). ... Label-Trim: Our proposed reliable data-cleaning method from Section 3.3, applied with a labeling budget of m = 50 annotations to label the m data points with the largest non-conformity scores from Dcal. ... Re Act with Res Net-18: ... The model applies a percentile-based threshold (set to 90%) to truncate activations, where the threshold is computed on the contaminated train set. ... SCALE with Res Net-18: ... The model rescales the activations using a sample-specific factor, defined as the sum of all activations divided by the sum of activations below a certain percentile (set to 65%). ... Local Outlier Factor (LOF) with 100 estimators and One-Class Support Vector Machine (OC-SVM) with an RBF kernel, both implemented via scikit-learn.