Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach
Authors: Qian Peng, Yajie Bao, Haojie Ren, Zhaojun Wang, Changliang Zou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6. Simulation We write N(µ, σ2) for the normal distribution with mean µ and variance σ2, SN(µ, σ2, α) for the skewed normal with skewness parameter α, t(k) for the t-distribution with k degrees of freedom, and Bern(p) for the Bernoulli distribution with success probability p. Given any x Rd, define f(x) = E(Yi|Xi = x) and ηi = Yi f(Xi). We consider three data generation settings in Lei et al. (2018): ... 7. Application on real data 7.1. Airfoil data We apply the proposed method to the airfoil dataset from the UCI Machine Learning Repository (Dua & Graff, 2019), where the response Y and covariates X (with 5 dimensions) are described in Appendix E.4. |
| Researcher Affiliation | Academia | 1School of Statistics and Data Sciences, LPMC, KLMDASR and LEBPS, Nankai University, Tianjin, China 2School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China. Correspondence to: Yajie Bao <EMAIL>, Haojie Ren <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 PDI-CP Input: Calibration set {(Xi, Yi)}n i=n0, test feature Xn+1, prediction model ˆµ, detection procedure D, imputation procedure I, miscoverage level α. 1: On+1 D( Xn+1) 2: ˇXDI n+1 I( Xn+1, On+1) 3: for i = n0, . . . , n do 4: ˆOi D(Xi) 5: ˇXi I(Xi, ˆOi On+1) 6: ˇRi |Yi ˆµ( ˇXi)| 7: end for 8: ˆCPDI( Xn+1) ˆµ( ˇXDI n+1) ˆq+ α ({ ˇRi}n i=n0) Output: ˆCPDI( Xn+1) |
| Open Source Code | No | No explicit statement about code release or a link to a code repository is provided in the paper. The 'Impact Statement' section only discusses the general applicability of the tools introduced. |
| Open Datasets | Yes | 7.1. Airfoil data We apply the proposed method to the airfoil dataset from the UCI Machine Learning Repository (Dua & Graff, 2019), where the response Y and covariates X (with 5 dimensions) are described in Appendix E.4. 7.2. Wind direction data Another example involves the hourly wind direction data from a meteorological station in the Central-West region of Brazil (https://tempo.inmet.gov.br/Tabela Estacoes/A001). 7.3. Riboflavin data To further demonstrate robustness, we test our method on the gene expression dataset for riboflavin production provided by DSM (Kaiseraugst, Switzerland), which was offered by B uhlmann & Mandozzi (2014) and confirmed to have cellwise outliers by Liu et al. (2022). |
| Dataset Splits | Yes | All simulation results in the following are averages over 200 trials with 200 labeled data and 100 test data. 7.1. Airfoil data We select 1000 labeled data and 500 test data in 100 trials. Since it is unknown which cells are outliers in reality, we artificially introduce outliers with ϵ = 0.02 to construct test features with both genuine and artificial cellwise outliers. The details of the experiment are presented in Appendix E.4. Creating training data, test data, and covariate shift: We repeated an experiment for 200 trials, and for each trial we randomly partition the data {(Xi, Yi)}1000 i=1 into two equally sized subsets Dt and Dc, and construct a test set Dtest containing cellwise outliers with the following steps. |
| Hardware Specification | No | The paper does not explicitly describe any specific hardware used for running its experiments, such as GPU models, CPU models, or cloud computing specifications. It only mentions funding sources in the 'Acknowledgements' section, which are not hardware specifications. |
| Software Dependencies | No | The paper mentions software components like 'random forests approach' (implicitly a machine learning library), 'k-Nearest Neighbour', 'Multivariate Imputation by Chained Equations', 'one-class SVM classifier', and 'DDC method'. However, it does not specify any version numbers for these software packages or libraries, which are necessary for reproducibility. |
| Experiment Setup | Yes | The nominal coverage level is set to be 1 α = 90% and d = 15. ...our method is still able to achieve target 1 α coverage. The empirical TPR (true positive rate) and FDR (false discovery rate) of detection methods are given in Appendix E.2. 6.1. Combinations with other detection methods This experiment is to verify the validity of our methods under other plausible cellwise detection methods besides DDC. Here we consider two procedures: the one-class SVM classifier method (Bates et al., 2023a) with τj = 0.2 and the cell MCD estimate method (Raymaekers & Rousseeuw, 2024a) with τj = qχ2 1,0.99, where τj is determined to control the FDR (false discovery rate). 6.3. Performance under different contaminated ratios Here we explore the effect of contamination levels ϵ on our method. We set D as DDC, and the parameter p in detection threshold τj = qχ2 1,p (adjusting p) corresponding to ϵ {0.1, 0.15, 0.2} to control FDR. Table 1. Empirical TPR and FDR of DDC with different thresholds under Setting A when ϵ = 0.1. p 0.99 0.9 0.7 0.5 TPR 0.987 0.992 0.995 1 FDR 0.035 0.340 0.669 0.793 PDI coverage 0.902 0.901 0.907 0.909 JDI coverage 0.895 0.899 0.904 0.904 |