Regularization via Mass Transportation
Authors: Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, Peyman Mohajerin Esfahani
JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our theoretical out-of-sample guarantees through simulated and empirical experiments. Numerical experiments are reported in Section 6. |
| Researcher Affiliation | Academia | Soroosh Shafieezadeh-Abadeh EMAIL Daniel Kuhn EMAIL Risk Analytics and Optimization Chair, EPFL, Switzerland Peyman Mohajerin Esfahani EMAIL Delft Center for Systems and Control TU Delft, The Netherlands |
| Pseudocode | No | The paper describes algorithms such as stochastic proximal gradient descent (W k+1 m = proxηkρ Wm W k m ηk Wmℓ(h(bxik; W k [M]), byik) m [M]) and provides descriptions of experimental procedures (e.g., training phase split into epochs, step size reduction) but does not present these in structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All experiments are run on an Intel XEON CPU (3.40GHz), and the corresponding codes are made publicly available at https://github.com/sorooshafi ee/Regularization-via-Transportation. |
| Open Datasets | Yes | We showcase the power of regularization via mass transportation in various applications based on standard datasets from the literature. ... from the MNIST database (Le Cun et al., 1998) ... from the UCI repository (Bache and Lichman, 2013). ... PASCAL VOC 2007 dataset Everingham et al. (2010) ... Image Net dataset (Krizhevsky et al., 2012) ... synthetic threenorm classification problem (Breiman, 1996). |
| Dataset Splits | Yes | In each trial, we randomly select 500 images to train the DRSVM model (20) and use the remaining images for testing. ... In each trial, we randomly select 75% of the data for training and the remaining 25% for testing. ... PASCAL VOC 2007 dataset ... pre-partitioned into 25% for training, 25% for validation and 50% for testing. ... In each trial we generate N training samples for some N {10, . . . , 90} {100, . . . , 1,000} as well as 105 test samples. |
| Hardware Specification | Yes | All experiments are run on an Intel XEON CPU (3.40GHz), and the corresponding codes are made publicly available at https://github.com/sorooshafi ee/Regularization-via-Transportation. |
| Software Dependencies | Yes | All optimization problems are implemented in Python and solved with Gurobi 7.5.1 |
| Experiment Setup | Yes | In the first experiment we optimize over linear hypotheses and use the separable transporation metric (16) involving the -norm on the input space. All results are averaged over 100 independent trials. In each trial, we randomly select 500 images to train the DRSVM model (20) and use the remaining images for testing. The correct classification rate (CCR) on the test data, averaged across all 100 trials, is visualized in Figure 1 as a function of the Wasserstein radius ρ for each κ {0.1, 0.25, 0.5, 0.75, }. The best out-of-sample CCR is obtained for κ = 0.25 uniformly across all Wasserstein radii, and performance deteriorates significantly when κ is reduced or increased. ... All free parameters of the resulting DRSVM model are restricted to finite search grids in order to ease the computational burden of cross validation. Specifically, we select the Wasserstein radius ρ from within {b 10e : b {1, 5}, e {1, 2, 3, 4}} and the label flipping cost κ from within {0.1, 0.25, 0.5, 0.75, }. Moreover, we select the degree d of the polynomial kernel from within {1, 2, 3, 4, 5} and the peakedness parameter γ of the Laplacian and Gaussian kernels from within { 1 25}. ... The Wasserstein radius ρ and the label flipping cost κ in the DRSVM as well as the regularization weight ρ in the RSVM are estimated via stratified 5-fold cross validation. ... At the beginning we preprocess the entire dataset by resizing each image to 256 256 pixels and extracting the central patch of 244 244 pixels. ... We tune the Wasserstein radius ρ {b 10e : b {1, . . . , 9}, e { 2, 3, 4}} and the label flipping cost κ {0.1, 0.2, . . . , 1, } via the holdout method using the validation data. ... we replace the original M-th layer of the network with a new fully connected layer characterized by a parameter matrix WM R20 1000, and we set σM to the Sigmoid activation function. ... We use the stochastic proximal gradient descent algorithm of Section 3.4 to tune WM, including an additional momentum term with weight 0.9. As in (Krizhevsky et al., 2012), we split the training phase into 100 epochs, each corresponding to a complete pass through the training dataset in a random order. As the ALEXNET requires input images of size 244 244, in each iteration we extract a random patch of 244 244 pixels from the current image and flip it horizontally at random. This procedure effectively augments the training dataset. The initial step size is set to 10 3 and then reduced by a factor of 10 after every 7 epochs. The algorithm terminates after 100 epochs. |