reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Detecting Errors in a Numerical Response via any Regression Model

Authors: Hang Zhou, Jonas Mueller, Mayank Kumar, Jane-Ling Wang, Jing Lei

DMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identiﬁes incorrect values with better precision/recall than other approaches.
Researcher Affiliation	Collaboration	Hang Zhou EMAIL Department of Statistics University of California Davis, CA 95618, USA Jonas Mueller EMAIL Cleanlab San Francisco, CA 94110, USA Mayank Kumar EMAIL Cleanlab San Francisco, CA 94110, USA Jane-Ling Wang EMAIL Department of Statistics University of California Davis, CA 95618, USA Jing Lei EMAIL Department of Statistics and Data Science Carnegie Mellon University Pittsburgh, PA 15213, USA
Pseudocode	Yes	Algorithm 1 Filtering procedure to reduce the amount of erroneous data Input: Dataset D; a regression model A; the maximum proportion of corrupted data Kerr. ... Algorithm 2 Conformal Outlier Detection Input: Training set Dtrain, calibration set Dcal, and testing set Dtest; a model A; a conformal score s(x, y); a target FDR level α.
Open Source Code	Yes	1. Code to run our method: https://github.com/cleanlab/cleanlab Code to reproduce paper: https://github.com/cleanlab/regression-label-error-benchmark
Open Datasets	Yes	Here, we evaluate the performance of our proposed methods using ﬁve publicly available datasets... Detailed information regarding these datasets can be found in Section B of the Supplement. ... Air Quality dataset: This benchmark dataset is a subset of data provided by the UCI repository at https://archive.ics.uci.edu/ml/datasets/air+quality. ... Metaphor Novelty dataset: This dataset is derived from data provided by http://hilt.cse.unt.edu/resources.html. ... Stanford Politeness Dataset (Stack edition): This dataset is derived from data provided by https://convokit.cornell.edu/documentation/stack_politeness.html. ... Stanford Politeness Dataset (Wikipedia edition): This dataset is derived from data provided by https://convokit.cornell.edu/documentation/wiki_politeness.html. ... q PCR Telomere: This dataset is a subset of the dataset generated through an R script provided by https://zenodo.org/record/2615735#.ZBpLES-B30p.
Dataset Splits	Yes	The splitting conformal method requires a training set to ﬁt the model, a calibration set to evaluate the rank of the scores, and a testing set to assess performance. For each setting, the training and calibration sets are generated based on the aforementioned settings without errors. For the testing set, 10% of the data are designated as errors with a corruption strength a = 3, 2, 1, 1, 2, 3, while the remaining 90% are benign datapoints, having the same distributions as those in the training and calibration sets... the sample size n = 200 is the same for Dtrain, Dcal, and Dtest. ... Fit model A via K-fold cross-validation over the whole dataset, and compute veracity scores for each datapoint via out-of-sample predictions.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	All regression models ﬁt in this paper (including the weighted ensemble) were implemented via the autogluon Auto ML package (Erickson et al., 2020), which automatically provides good hyperparameter settings and manages the training of each model. ... When applying RANSAC, we used its default hyperparameter settings in the scikit-learn package. ... We use numerical covariates obtained by embedding each text example via a pretrained Transformer network from the Sentence Transformers package Reimers and Gurevych (2019). (The paper mentions software packages like AutoGluon, Light GBM, scikit-learn, and Sentence Transformers but does not provide specific version numbers for any of them.)
Experiment Setup	No	All regression models ﬁt in this paper (including the weighted ensemble) were implemented via the autogluon Auto ML package (Erickson et al., 2020), which automatically provides good hyperparameter settings and manages the training of each model.