Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles
Authors: Jiefeng Chen, Frederick Liu, Besim Avci, Xi Wu, Yingyu Liang, Somesh Jha
NeurIPS 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on 59 tasks over five dataset categories including image classification and sentiment classification datasets show that our method achieves state-of-the-art on both accuracy estimation and error detection (Section 7). |
| Researcher Affiliation | Collaboration | Jiefeng Chen Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 EMAIL Frederick Liu Google Seattle, WA 98103 EMAIL Besim Avci Google Seattle, WA 98103 EMAIL Xi Wu Google Madison, WI 53703 EMAIL Yingyu Liang Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 EMAIL Somesh Jha Department of Computer Science University of Wisconsin-Madison Madison, WI 53706 EMAIL |
| Pseudocode | Yes | Framework 1 Error Detection and Unsupervised Accuracy Estimation via Self-Training Ensembles (Page 3) and Algorithm 1, 2, 3 (Pages 5-6) provide structured pseudo-code. |
| Open Source Code | Yes | Our code is available at: https://github.com/jfc43/self-training-ensembles. |
| Open Datasets | Yes | We use the following dataset categories: Digits (including MNIST [26], MNIST-M [12], SVHN [29], USPS [19]), Office-31 [33], CIFAR10-C [24], i Wild Cam [1] and Amazon Review [2]. |
| Dataset Splits | Yes | For all image datasets, we use random split of 80% training data for training and 20% training data for validation. |
| Hardware Specification | Yes | For training, we use NVIDIA GPUs (e.g., V100 or A100). |
| Software Dependencies | No | The paper does not explicitly list software dependencies with version numbers, such as specific Python, PyTorch, or TensorFlow versions. |
| Experiment Setup | Yes | We train all models for 100 epochs with Adam optimizer, initial learning rate 1e-3, learning rate decay by 0.5 every 20 epochs, and batch size 64. ... In our experiments, we set T = 5 and N = 5 by considering the computational cost (on Amazon Review, we set N = 20). We set γ = 0.1 and set α following the domain adaptation methods. |