reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles

Authors: Jiefeng Chen, Frederick Liu, Besim Avci, Xi Wu, Yingyu Liang, Somesh Jha

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on 59 tasks over ﬁve dataset categories including image classiﬁcation and sentiment classiﬁcation datasets show that our method achieves state-of-the-art on both accuracy estimation and error detection (Section 7).
Researcher Affiliation	Collaboration	Jiefeng Chen Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 EMAIL Frederick Liu Google Seattle, WA 98103 EMAIL Besim Avci Google Seattle, WA 98103 EMAIL Xi Wu Google Madison, WI 53703 EMAIL Yingyu Liang Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 EMAIL Somesh Jha Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 EMAIL
Pseudocode	Yes	Framework 1 Error Detection and Unsupervised Accuracy Estimation via Self-Training Ensembles (Page 3) and Algorithm 1, 2, 3 (Pages 5-6) provide structured pseudo-code.
Open Source Code	Yes	Our code is available at: https://github.com/jfc43/self-training-ensembles.
Open Datasets	Yes	We use the following dataset categories: Digits (including MNIST [26], MNIST-M [12], SVHN [29], USPS [19]), Ofﬁce-31 [33], CIFAR10-C [24], i Wild Cam [1] and Amazon Review [2].
Dataset Splits	Yes	For all image datasets, we use random split of 80% training data for training and 20% training data for validation.
Hardware Specification	Yes	For training, we use NVIDIA GPUs (e.g., V100 or A100).
Software Dependencies	No	The paper does not explicitly list software dependencies with version numbers, such as specific Python, PyTorch, or TensorFlow versions.
Experiment Setup	Yes	We train all models for 100 epochs with Adam optimizer, initial learning rate 1e-3, learning rate decay by 0.5 every 20 epochs, and batch size 64. ... In our experiments, we set T = 5 and N = 5 by considering the computational cost (on Amazon Review, we set N = 20). We set γ = 0.1 and set α following the domain adaptation methods.