reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Confirmation Bias in Semi-supervised Learning via Efficient Bayesian Model Averaging

Authors: Charlotte Loh, Rumen Dangovski, Shivchander Sudalairaj, Seungwook Han, Ligong Han, Leonid Karlinsky, Marin Soljacic, Akash Srivastava

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that Ba M- effectively improves model calibration, resulting in better performances on standard benchmarks like CIFAR-10 and CIFAR-100, notably giving up to 16% gains in test accuracy.
Researcher Affiliation	Collaboration	Charlotte Loh EMAIL MIT EECS MIT-IBM Watson AI Lab Rumen Dangovski EMAIL MIT EECS Shivchander Sudalairaj EMAIL MIT-IBM Watson AI Lab Seungwook Han EMAIL MIT EECS MIT-IBM Watson AI Lab Ligong Han EMAIL Rutgers University MIT-IBM Watson AI Lab Leonid Karlinsky EMAIL MIT-IBM Watson AI Lab Marin Soljačić EMAIL MIT Physics Akash Srivastava EMAIL MIT-IBM Watson AI Lab
Pseudocode	Yes	Algorithm 1 Snippet of Py Torch-style pseudocode showing pseudo-labeling in Ba M-UDA. Algorithm 2 Py Torch-style pseudocode for Bayesian model averaging in UDA or Fix Match. Algorithm 3 Py Torch-style pseudocode for PAWS-SWA and PAWS-EMA.
Open Source Code	No	The paper does not contain an explicit statement about releasing its source code, a direct link to a code repository, or mention of code in supplementary materials.
Open Datasets	Yes	We demonstrate that Ba M- mitigates confirmation bias in SOTA SSL methods across standard vision benchmarks of CIFAR-10, CIFAR-100 and Image Net, giving up to 16% improvement in test accuracy on the CIFAR-100 with 400 labels benchmark.
Dataset Splits	Yes	Results are averaged over 3 random dataset splits. CIFAR-10 CIFAR-100 250 labels 2500 labels 400 labels 4000 labels 10000 labels We curate long-tailed versions from the CIFAR datasets following Cao et al. (2019), where α indicates the imbalance ratio... We randomly select 10% of samples from each class, under the constrain that at least 1 sample for each class is included in the labeled set, i.e. nl,i = min(1, 0.1 nu,i), where nl,i is the number of labeled examples for class i. The total number of labeled and unlabeled examples for the different benchmarks in CIFAR-10-LT and CIFAR-100-LT are summarized in Table 9. The test set remains unchanged, i.e. we use the original (class-balanced) test set of the CIFAR datasets with 10,000 samples.
Hardware Specification	Yes	All CIFAR-10 and CIFAR-100 experiments in this work were computed using a single Nvidia V100 GPU. Image Net experiments were computed using 64 Nvidia V100 GPUs.
Software Dependencies	No	The paper mentions "Py Torch-style pseudocode" indicating the use of PyTorch, but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	In all our experiments, we begin with and modify upon the original implementations of the baseline SSL methods. The backbone encoder f is a Wide Res Net-28-2 and Wide Res Net-28-8 for the CIFAR-10 and CIFAR-100 benchmarks respectively. We use the default hyperparameters and dataset-specific settings (learning rates, batch size, optimizers and schedulers) recommended by the original authors for both the baselines and in Ba M-. We set the weight priors in Ba M- as unit Gaussians and use a separate Adam optimizer for the BNN layer with learning rate 0.01, no weight decay and impose the same cosine learning rate scheduler as the backbone. We set Q = 0.75 for the CIFAR-100 benchmark and Q = 0.95 for the CIFAR-10 benchmark; which are both linearly warmed-up from 0.1 in the first 10 epochs. As Q is computed across batches, we improve stability by using a moving average of the last 50 quantiles.