Towards Understanding Why FixMatch Generalizes Better Than Supervised Learning
Authors: Jingyang Li, Jiachun Pan, Vincent Tan, Kim-chuan Toh, Pan Zhou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results corroborate our theoretical findings and the enhanced generalization capability of SA-Fix Match. To corroborate our theoretical results, we evaluate SL, Fix Match, and SA-Fix Match on CIFAR-100 (Krizhevsky et al., 2009), STL-10 (Coates et al., 2011), Imagewoof (Howard & Gugger, 2020), and Image Net (Deng et al., 2009). |
| Researcher Affiliation | Academia | Jingyang Li1 Jiachun Pan1 Vincent Y. F. Tan1 Kim-Chuan Toh1 Pan Zhou2 1National University of Singapore 2 Singapore Management University EMAIL EMAIL EMAIL EMAIL |
| Pseudocode | Yes | J (SA-)FIXMATCH ALGORITHM In this section, we present the detailed algorithm framework for Fix Match (Sohn et al., 2020) and SA-Fix Match. At iteration t, we first sample a batch of B labeled data X (t) from labeled dataset Zl, and a batch of µB unlabeled data U(t) from unlabeled dataset Zu. Then, according to Algorithm 1, we calculate the loss for current iteration, and use it for the update of the neural network model F (t). The only difference between Fix Match and SA-Fix Match is in line 6, where Fix Match adopts Cut Out in its strong augmentation of unlabeled data A, while SA-Fix Match adopts SA-Cut Out. Algorithm 1 (SA-)Fix Match algorithm. 1: Input: Labeled batch X (t) = {(Xi, yi) : i (1, . . . , B)}, unlabeled batch U(t) = {Ui : i (1, . . . , µB)}, confidence threshold τ, unlabeled data ratio µ, unlabeled loss weight λ. 2: L(t) s = 1 B PB i=1 log logityi(F (t), α(Xi)) {Cross-entropy loss for labeled data} 3: for i = 1 to µB do 4: vi = arg maxj{logitj(F (t), α(Ui))} {Compute prediction after applying weak data augmentation of Ui} 5: end for 6: L(t) u = 1 µB PµB i=1 I{logitvi(F (t),α(Ui)) τ} log logitvi(F (t), A(Ui)) {Cross-entropy loss with pseudo-label and confidence for unlabeled data} 7: return: L(t) s + λL(t) u |
| Open Source Code | No | For Fix Match experiments, we base our implementation on Kim (2020), while all other experiments follow Wang et al. (2022a). (The references are to third-party code/benchmarks, not the authors' own implementation for this paper.) |
| Open Datasets | Yes | To corroborate our theoretical results, we evaluate SL, Fix Match, and SA-Fix Match on CIFAR100 (Krizhevsky et al., 2009), STL-10 (Coates et al., 2011), Imagewoof (Howard & Gugger, 2020), and Image Net (Deng et al., 2009). |
| Dataset Splits | Yes | For each experiment in Sec. 5, following Sohn et al. (2020); Zhang et al. (2021a); Xu et al. (2021); Wang et al. (2022b); Chen et al. (2023), we randomly select image-label pairs from the entire training dataset according to labeled data amount, set images from the whole training dataset without labels as unlabeled dataset, and we use the standard test dataset. The table below details data statistics across different datasets. Table 9: Summary of Datasets. Dataset Total Training Data Total Labeled Data in Training Data Test Data STL-10 105000 5000 8000 CIFAR-100 50000 50000 10000 Imagewoof 9025 9025 3929 Image Net 1281167 1281167 50000 |
| Hardware Specification | Yes | All experiments are conducted on four RTX 3090 GPUs (24GB memory). |
| Software Dependencies | No | No specific versions of software dependencies (e.g., Python, PyTorch, CUDA) are mentioned in the paper, only that the optimizer is standard SGD. The implementation for FixMatch experiments is based on Kim (2020), which is a PyTorch implementation. |
| Experiment Setup | Yes | For hyper-parameters, we use the same setting following Fix Match (Sohn et al., 2020). Concretely, the optimizer for all experiments is standard stochastic gradient descent (SGD) with a momentum of 0.9 (Sutskever et al., 2013). For all datasets, we use an initial learning rate of 0.03 with a cosine learning rate decay schedule (Loshchilov & Hutter, 2016) as η = η0 cos 7πk 16K , where η0 is the initial learning rate, k is the current training step and K is the total training step that is set to 307200. We also perform an exponential moving average with the momentum of 0.999. The hyper-parameter settings are summarized in Table 10. Table 10: Complete hyper-parameter setting. Dataset CIFAR-100 STL-10 Imagewoof Image Net Model WRN-28-8 WRN-37-2 WRN-37-2 Res Net-50 Weight Decay 1e-3 5e-4 5e-4 3e-4 Batch Size 64 128 Unlabeled Data Raion µ 7 1 Threshold τ 0.95 0.7 Learning Rate η 0.03 SGD Momentum 0.9 EMA Momentum 0.999 Unsupervised Loss Weight λ 1 |