Nearest Neighbor Sampling for Covariate Shift Adaptation
Authors: François Portier, Lionel Truquet, Ikko Yamane
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The main purpose of the experiments is to compare our k-NN-CSA approach with several state-of-the-art competitors when facing multiple situations from mean estimation to empirical risk minimization with synthetic and real-world data. We conduct experiments in three setups, detailed below, with different sample sizes n (= m) and data dimensionalities d: (n, d) {50, 100, 500, 1000, 5000, 10000} {1, 2, 5, 10}. Each experiment is repeated 50 times with different random seeds. The results are presented in Figure 1. Figure 2 shows the comparison in running times. We present the estimation errors for Experiment E2 in Figure 3. The results are summarized in Figures 5. Table 1 shows the obtained MSEs and classification accuracies. |
| Researcher Affiliation | Academia | François Portier EMAIL Department of Statistics, Univ Rennes, Ensai, CNRS, CREST UMR 9194, F-35000 Rennes, France. Lionel Truquet EMAIL Department of Statistics, Univ Rennes, Ensai, CNRS, CREST UMR 9194, F-35000 Rennes, France. Ikko Yamane EMAIL Department of Computer Science, Univ Rennes, Ensai, CNRS, CREST UMR 9194, F-35000 Rennes, France. |
| Pseudocode | Yes | Algorithm 1 Conditional Sampling Adaptation. Input: Conditional sampler ˆS and target sample (X j )m j=1. Y n,j ˆS(X j ) for each j {1, . . . , m}. // Generate a label conditioned on X j . return m 1 Pm j=1 h(X j , Y n,j). Algorithm 2 k-Nearest Neighbor Conditional Sampler. Input: Source sample (Xi, Yi)n i=1 and target input X j . (i1, . . . , ik) the indices of the k-nearest neighbors of X j among (Xi)n i=1. Pick i {i1, . . . , ik} uniformly at random. return Y n,j := Y i . |
| Open Source Code | No | The paper mentions using third-party Python modules and toolboxes like 'c KDTree' from SciPy and 'Awesome Domain Adaptation Python Toolbox (ADAPT)' for implementations of other methods, but it does not provide an explicit statement of code release or a link to a repository for the authors' own described methodology. |
| Open Datasets | Yes | We use regression benchmark datasets, diabetes4, california (Pace and Barry, 1997)5 and classification datasets, twonorm (Breiman, 1996)6 and breast_cancer4. 4. Available at https://archive.ics.uci.edu/ml/index.php. 5. Available at https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html. 6. Available at https://www.cs.utoronto.ca/~delve/data/datasets.html. |
| Dataset Splits | Yes | We conduct experiments in three setups, detailed below, with different sample sizes n (= m) and data dimensionalities d: (n, d) {50, 100, 500, 1000, 5000, 10000} {1, 2, 5, 10}. Each experiment is repeated 50 times with different random seeds. We split the original to the training and test set and simulate covariate shift by rejection sampling from the test set with rejection probability determined according to the value of a covariate. Table 3: Basic information of the datasets: source sample size n, Target sample size m. |
| Hardware Specification | No | All the computations were performed on the cluster, Grid5000 (Balouek et al., 2013). This mentions a specific cluster, but it does not provide details about the specific hardware components (CPU models, GPU models, memory) used within that cluster. |
| Software Dependencies | No | We use the Python module c KDTree (Archibald, 2008) from Sci Py (Virtanen et al., 2020) for nearest neighbor search in our methods. For KMM-W, KLIEP-W, and Ru LSIF-W, we used the implementations from Awesome Domain Adaptation Python Toolbox (ADAPT) (de Mathelin et al., 2021). While software packages are named, specific version numbers for these tools (e.g., SciPy version, ADAPT version) are not provided in the text. |
| Experiment Setup | Yes | We conduct experiments in three setups, detailed below, with different sample sizes n (= m) and data dimensionalities d: (n, d) {50, 100, 500, 1000, 5000, 10000} {1, 2, 5, 10}. Each experiment is repeated 50 times with different random seeds. For the methods using Gaussian basis functions (KLIEP-W, KLIEP100-W, Ru LSIF-W, Ru LSIF100-W), we use 5-fold cross-validation for choosing the Gaussian bandwidth from {0.001, 0.01, 0.1, 1, 10}. KMM-W does not offer a way to do cross-validation, and we fixed to 1. In Experiment E3, we perform the ordinary least squares. In Experiment E4, we apply the ridge regression and the logistic regression. |