Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Mitigating Confirmation Bias in Semi-supervised Learning via Efficient Bayesian Model Averaging
Authors: Charlotte Loh, Rumen Dangovski, Shivchander Sudalairaj, Seungwook Han, Ligong Han, Leonid Karlinsky, Marin Soljacic, Akash Srivastava
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that Ba M- effectively improves model calibration, resulting in better performances on standard benchmarks like CIFAR-10 and CIFAR-100, notably giving up to 16% gains in test accuracy. |
| Researcher Affiliation | Collaboration | Charlotte Loh EMAIL MIT EECS MIT-IBM Watson AI Lab Rumen Dangovski EMAIL MIT EECS Shivchander Sudalairaj EMAIL MIT-IBM Watson AI Lab Seungwook Han EMAIL MIT EECS MIT-IBM Watson AI Lab Ligong Han EMAIL Rutgers University MIT-IBM Watson AI Lab Leonid Karlinsky EMAIL MIT-IBM Watson AI Lab Marin Soljačić EMAIL MIT Physics Akash Srivastava EMAIL MIT-IBM Watson AI Lab |
| Pseudocode | Yes | Algorithm 1 Snippet of Py Torch-style pseudocode showing pseudo-labeling in Ba M-UDA. Algorithm 2 Py Torch-style pseudocode for Bayesian model averaging in UDA or Fix Match. Algorithm 3 Py Torch-style pseudocode for PAWS-SWA and PAWS-EMA. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its source code, a direct link to a code repository, or mention of code in supplementary materials. |
| Open Datasets | Yes | We demonstrate that Ba M- mitigates confirmation bias in SOTA SSL methods across standard vision benchmarks of CIFAR-10, CIFAR-100 and Image Net, giving up to 16% improvement in test accuracy on the CIFAR-100 with 400 labels benchmark. |
| Dataset Splits | Yes | Results are averaged over 3 random dataset splits. CIFAR-10 CIFAR-100 250 labels 2500 labels 400 labels 4000 labels 10000 labels We curate long-tailed versions from the CIFAR datasets following Cao et al. (2019), where α indicates the imbalance ratio... We randomly select 10% of samples from each class, under the constrain that at least 1 sample for each class is included in the labeled set, i.e. nl,i = min(1, 0.1 nu,i), where nl,i is the number of labeled examples for class i. The total number of labeled and unlabeled examples for the different benchmarks in CIFAR-10-LT and CIFAR-100-LT are summarized in Table 9. The test set remains unchanged, i.e. we use the original (class-balanced) test set of the CIFAR datasets with 10,000 samples. |
| Hardware Specification | Yes | All CIFAR-10 and CIFAR-100 experiments in this work were computed using a single Nvidia V100 GPU. Image Net experiments were computed using 64 Nvidia V100 GPUs. |
| Software Dependencies | No | The paper mentions "Py Torch-style pseudocode" indicating the use of PyTorch, but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | In all our experiments, we begin with and modify upon the original implementations of the baseline SSL methods. The backbone encoder f is a Wide Res Net-28-2 and Wide Res Net-28-8 for the CIFAR-10 and CIFAR-100 benchmarks respectively. We use the default hyperparameters and dataset-specific settings (learning rates, batch size, optimizers and schedulers) recommended by the original authors for both the baselines and in Ba M-. We set the weight priors in Ba M- as unit Gaussians and use a separate Adam optimizer for the BNN layer with learning rate 0.01, no weight decay and impose the same cosine learning rate scheduler as the backbone. We set Q = 0.75 for the CIFAR-100 benchmark and Q = 0.95 for the CIFAR-10 benchmark; which are both linearly warmed-up from 0.1 in the first 10 epochs. As Q is computed across batches, we improve stability by using a moving average of the last 50 quantiles. |