Domain Generalization by Marginal Transfer Learning
Authors: Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, Clayton Scott
JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present two formal models of data generation, corresponding notions of risk, and distribution-free generalization error analysis. By focusing our attention on kernel methods, we also provide more quantitative results and a universally consistent algorithm. An efficient implementation is provided for this algorithm, which is experimentally compared to a pooling strategy on one synthetic and three real-world data sets. |
| Researcher Affiliation | Collaboration | Gilles Blanchard EMAIL Universit e Paris-Saclay, CNRS, Inria, Laboratoire de mathematiques d Orsay Aniket Anand Deshmukh EMAIL Microsoft AI & Research Urun Dogan EMAIL Microsoft AI & Research Gyemin Lee EMAIL Dept. Electronic and IT Media Engineering Seoul National University of Science and Technology Clayton Scott EMAIL Electrical and Computer Engineering, Statistics University of Michigan |
| Pseudocode | Yes | Algorithm 1: Synthetic Data Generation |
| Open Source Code | Yes | 1. Code is available at https://github.com/aniketde/Domain Generalization Marginal. 3. Code is available at https://github.com/aniketde/Domain Generalization Marginal 4. Code available at https://github.com/aniketde/Domain Generalization Marginal |
| Open Datasets | Yes | To illustrate the flow cytometry gating problem, we use the NDD data set from the Flow Cap-I challenge. We test our method in the regression setting using the Parkinson s disease telemonitoring data set We thank Srinagesh Sharma and James Cutler for providing us with their simulated data, and refer the reader to their paper for more details on the application (Sharma and Cutler, 2015). We used a data set that is a part of the Flow CAP Challenges where the ground truth labels have been supplied by human experts (Aghaeepour et al., 2013). We used the so-called Normal Donors data set. |
| Dataset Splits | Yes | For each synthetic data set, the test set contains 10 tasks and each task contains one million data points. We randomly select 7 test users and then vary the number of training users N from 10 to 35 in steps of 5, and we also vary the number of training examples n per user from 20 to 100. To demonstrate this idea, we analyzed the data from Sharma and Cutler (2015) for T = 50 launches, viewing up to 40 as training data and 10 as testing. We randomly selected 10 tasks to serve as the test tasks. These tasks were removed from the pool of eligible training tasks. We varied the number of training tasks from 5 to 20 with an additive step size of 5, and the number of training examples per task from 1024 to 16384 with a multiplicative step size of 2. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments. It mentions 'speedup' and 'efficient implementation' but does not specify CPU/GPU models, memory, or other hardware specifications. |
| Software Dependencies | No | The paper mentions software like 'Liblinear (Fan et al., 2008)' and 'LIBSVM' (referencing Chang and Lin, 2011), but does not provide specific version numbers for these or any other software components used in their implementation. |
| Experiment Setup | Yes | The bandwidth σ of each Gaussian kernel and the regularization parameter λ of the machines were selected by grid search. For model selection, five-fold cross-validation was used... repeated 5 times over independent random splits into folds... The grid used for kernels was σ 10 2, 104 with logarithmic spacing, and the grid used for the regularization parameter was λ 10 1, 101 with logarithmic spacing. ... hinge loss is employed, and one regression problem (Y R), where the ϵ-insensitive loss is employed. ... The random Fourier features speedup is used ... The Nyström approximation was used. |