reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

$k$-Mixup Regularization for Deep Learning via Optimal Transport

Authors: Kristjan Greenewald, Anming Gu, Mikhail Yurochkin, Justin Solomon, Edward Chien

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results show that training with k-mixup further improves generalization and robustness across several network architectures and benchmark datasets of differing modalities. For the wide variety of real datasets considered, the performance gains of k-mixup over standard mixup are similar to or larger than the gains of mixup itself over standard ERM after hyperparameter optimization. In several instances, in fact, k-mixup achieves gains in settings where standard mixup has negligible to zero improvement over ERM.
Researcher Affiliation	Collaboration	Kristjan Greenewald EMAIL MIT-IBM Watson AI Lab, IBM Research Anming Gu EMAIL Boston University Mikhail Yurochkin EMAIL MIT-IBM Watson AI Lab, IBM Research Justin Solomon EMAIL Massachusetts Institute of Technology Edward Chien EMAIL Boston University
Pseudocode	Yes	Figure 5: k-mixup implementation. # y1, y2 should be one-hot vectors for (x1, y1), (x2, y2) in zip(loader1, loader2): idx = numpy.zeros_like(y1) for i in range(x1.shape[0] // k): cost = scipy.spatial. distance_matrix(x1[i * k:(i+1) * k], x2[i * k:(i+1) * k]) _, ix = scipy.optimize.linear_sum_assignment(cost) idx[i * k:(i+1) * k] = ix + i * k x2 = x2[idx] y2 = y2[idx] lam = numpy.random.beta(alpha, alpha) x = Variable(lam * x1 + (1 lam) * x2) y = Variable(lam * y1 + (1 lam) * y2) optimizer.zero_grad() loss(net(x), y).backward() optimizer.step()
Open Source Code	Yes	Python code for applying k-mixup to CIFAR10 can be found at https://github.com/Anming Gu/kmixup-cifar10.
Open Datasets	Yes	Our most extensive testing was done on image datasets... MNIST (Le Cun & Cortes, 2010)... Tiny Image Net... UCI datasets (Dua & Graff, 2017)... Speech dataset: Google Speech Commands (Warden, 2018).
Dataset Splits	Yes	Our most extensive testing was done on image datasets... In Table 1, we show our summarized error rates across various benchmark datasets and network architectures... For Tiny Image Net, we again see that for each α / ξ, the best generalization performance is for some k > 1 (note that the ξ are larger for Tiny Image Net than MNIST due to differences in normalization and size of the images)... Figure 6: Training convergence of k = 1 and k = 32-mixup on CIFAR-10, averaged over 20 random trials... Speech dataset. Performance is also tested on a speech dataset: Google Speech Commands (Warden, 2018). We augmented the data in the same way as Zhang et al. (2018), i.e. we sample the spectrograms from the data using a sampling rate of 16 k Hz and equalize their sizes at 160 101.
Hardware Specification	No	We have included a brief Py Torch pseudocode in Section 5.1 below and note that with CIFAR-10 and k = 32, the use of k-mixup added about one second per epoch on GPU. Note that, on our hardware, the overall cost of an epoch is greater than 30 seconds. The provided text mentions "on GPU" and "on our hardware" but lacks specific models or detailed specifications.
Software Dependencies	No	We have included a brief Py Torch pseudocode in Section 5.1 below and note that with CIFAR-10 and k = 32, the use of k-mixup added about one second per epoch on GPU. ... idx = numpy.zeros_like(y1) cost = scipy.spatial. distance_matrix(x1[i * k:(i+1) * k], x2[i * k:(i+1) * k]) _, ix = scipy.optimize.linear_sum_assignment(cost) ... The paper mentions PyTorch, numpy, and scipy but does not provide specific version numbers for these software components.
Experiment Setup	Yes	Unless otherwise stated, our training is done over 200 epochs via a standard SGD optimizer, with learning rate 0.1 decreased at epochs 100 and 150, momentum 0.9, and weight decay 10^-4. ... We show here sweeps over k and ξ, choosing k and ξ in practice is discussed in Supplement Section A. ... For Iris, we used a 3-layer network with 120 and 84 hidden units; for Breast Cancer, Abalone, and Phishing, we used a 4-layer network with 120, 120, and 84 hidden units; and lastly, for Arrhythmia we used a 5-layer network with 120, 120, 36, and 84 hidden units. For these datasets we used a learning rate of 0.005 instead of 0.1.