$k$-Mixup Regularization for Deep Learning via Optimal Transport
Authors: Kristjan Greenewald, Anming Gu, Mikhail Yurochkin, Justin Solomon, Edward Chien
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results show that training with k-mixup further improves generalization and robustness across several network architectures and benchmark datasets of differing modalities. For the wide variety of real datasets considered, the performance gains of k-mixup over standard mixup are similar to or larger than the gains of mixup itself over standard ERM after hyperparameter optimization. In several instances, in fact, k-mixup achieves gains in settings where standard mixup has negligible to zero improvement over ERM. |
| Researcher Affiliation | Collaboration | Kristjan Greenewald EMAIL MIT-IBM Watson AI Lab, IBM Research Anming Gu EMAIL Boston University Mikhail Yurochkin EMAIL MIT-IBM Watson AI Lab, IBM Research Justin Solomon EMAIL Massachusetts Institute of Technology Edward Chien EMAIL Boston University |
| Pseudocode | Yes | Figure 5: k-mixup implementation. # y1, y2 should be one-hot vectors for (x1, y1), (x2, y2) in zip(loader1, loader2): idx = numpy.zeros_like(y1) for i in range(x1.shape[0] // k): cost = scipy.spatial. distance_matrix(x1[i * k:(i+1) * k], x2[i * k:(i+1) * k]) _, ix = scipy.optimize.linear_sum_assignment(cost) idx[i * k:(i+1) * k] = ix + i * k x2 = x2[idx] y2 = y2[idx] lam = numpy.random.beta(alpha, alpha) x = Variable(lam * x1 + (1 lam) * x2) y = Variable(lam * y1 + (1 lam) * y2) optimizer.zero_grad() loss(net(x), y).backward() optimizer.step() |
| Open Source Code | Yes | Python code for applying k-mixup to CIFAR10 can be found at https://github.com/Anming Gu/kmixup-cifar10. |
| Open Datasets | Yes | Our most extensive testing was done on image datasets... MNIST (Le Cun & Cortes, 2010)... Tiny Image Net... UCI datasets (Dua & Graff, 2017)... Speech dataset: Google Speech Commands (Warden, 2018). |
| Dataset Splits | Yes | Our most extensive testing was done on image datasets... In Table 1, we show our summarized error rates across various benchmark datasets and network architectures... For Tiny Image Net, we again see that for each α / ξ, the best generalization performance is for some k > 1 (note that the ξ are larger for Tiny Image Net than MNIST due to differences in normalization and size of the images)... Figure 6: Training convergence of k = 1 and k = 32-mixup on CIFAR-10, averaged over 20 random trials... Speech dataset. Performance is also tested on a speech dataset: Google Speech Commands (Warden, 2018). We augmented the data in the same way as Zhang et al. (2018), i.e. we sample the spectrograms from the data using a sampling rate of 16 k Hz and equalize their sizes at 160 101. |
| Hardware Specification | No | We have included a brief Py Torch pseudocode in Section 5.1 below and note that with CIFAR-10 and k = 32, the use of k-mixup added about one second per epoch on GPU. Note that, on our hardware, the overall cost of an epoch is greater than 30 seconds. The provided text mentions "on GPU" and "on our hardware" but lacks specific models or detailed specifications. |
| Software Dependencies | No | We have included a brief Py Torch pseudocode in Section 5.1 below and note that with CIFAR-10 and k = 32, the use of k-mixup added about one second per epoch on GPU. ... idx = numpy.zeros_like(y1) cost = scipy.spatial. distance_matrix(x1[i * k:(i+1) * k], x2[i * k:(i+1) * k]) _, ix = scipy.optimize.linear_sum_assignment(cost) ... The paper mentions PyTorch, numpy, and scipy but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Unless otherwise stated, our training is done over 200 epochs via a standard SGD optimizer, with learning rate 0.1 decreased at epochs 100 and 150, momentum 0.9, and weight decay 10^-4. ... We show here sweeps over k and ξ, choosing k and ξ in practice is discussed in Supplement Section A. ... For Iris, we used a 3-layer network with 120 and 84 hidden units; for Breast Cancer, Abalone, and Phishing, we used a 4-layer network with 120, 120, and 84 hidden units; and lastly, for Arrhythmia we used a 5-layer network with 120, 120, 36, and 84 hidden units. For these datasets we used a learning rate of 0.005 instead of 0.1. |