Detecting Errors in a Numerical Response via any Regression Model

Authors: Hang Zhou, Jonas Mueller, Mayank Kumar, Jane-Ling Wang, Jing Lei

DMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.
Researcher Affiliation Collaboration Hang Zhou EMAIL Department of Statistics University of California Davis, CA 95618, USA Jonas Mueller EMAIL Cleanlab San Francisco, CA 94110, USA Mayank Kumar EMAIL Cleanlab San Francisco, CA 94110, USA Jane-Ling Wang EMAIL Department of Statistics University of California Davis, CA 95618, USA Jing Lei EMAIL Department of Statistics and Data Science Carnegie Mellon University Pittsburgh, PA 15213, USA
Pseudocode Yes Algorithm 1 Filtering procedure to reduce the amount of erroneous data Input: Dataset D; a regression model A; the maximum proportion of corrupted data Kerr. ... Algorithm 2 Conformal Outlier Detection Input: Training set Dtrain, calibration set Dcal, and testing set Dtest; a model A; a conformal score s(x, y); a target FDR level α.
Open Source Code Yes 1. Code to run our method: https://github.com/cleanlab/cleanlab Code to reproduce paper: https://github.com/cleanlab/regression-label-error-benchmark
Open Datasets Yes Here, we evaluate the performance of our proposed methods using five publicly available datasets... Detailed information regarding these datasets can be found in Section B of the Supplement. ... Air Quality dataset: This benchmark dataset is a subset of data provided by the UCI repository at https://archive.ics.uci.edu/ml/datasets/air+quality. ... Metaphor Novelty dataset: This dataset is derived from data provided by http://hilt.cse.unt.edu/resources.html. ... Stanford Politeness Dataset (Stack edition): This dataset is derived from data provided by https://convokit.cornell.edu/documentation/stack_politeness.html. ... Stanford Politeness Dataset (Wikipedia edition): This dataset is derived from data provided by https://convokit.cornell.edu/documentation/wiki_politeness.html. ... q PCR Telomere: This dataset is a subset of the dataset generated through an R script provided by https://zenodo.org/record/2615735#.ZBpLES-B30p.
Dataset Splits Yes The splitting conformal method requires a training set to fit the model, a calibration set to evaluate the rank of the scores, and a testing set to assess performance. For each setting, the training and calibration sets are generated based on the aforementioned settings without errors. For the testing set, 10% of the data are designated as errors with a corruption strength a = 3, 2, 1, 1, 2, 3, while the remaining 90% are benign datapoints, having the same distributions as those in the training and calibration sets... the sample size n = 200 is the same for Dtrain, Dcal, and Dtest. ... Fit model A via K-fold cross-validation over the whole dataset, and compute veracity scores for each datapoint via out-of-sample predictions.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No All regression models fit in this paper (including the weighted ensemble) were implemented via the autogluon Auto ML package (Erickson et al., 2020), which automatically provides good hyperparameter settings and manages the training of each model. ... When applying RANSAC, we used its default hyperparameter settings in the scikit-learn package. ... We use numerical covariates obtained by embedding each text example via a pretrained Transformer network from the Sentence Transformers package Reimers and Gurevych (2019). (The paper mentions software packages like AutoGluon, Light GBM, scikit-learn, and Sentence Transformers but does not provide specific version numbers for any of them.)
Experiment Setup No All regression models fit in this paper (including the weighted ensemble) were implemented via the autogluon Auto ML package (Erickson et al., 2020), which automatically provides good hyperparameter settings and manages the training of each model.