reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Nearest Neighbor Sampling for Covariate Shift Adaptation

Authors: François Portier, Lionel Truquet, Ikko Yamane

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The main purpose of the experiments is to compare our k-NN-CSA approach with several state-of-the-art competitors when facing multiple situations from mean estimation to empirical risk minimization with synthetic and real-world data. We conduct experiments in three setups, detailed below, with diﬀerent sample sizes n (= m) and data dimensionalities d: (n, d) {50, 100, 500, 1000, 5000, 10000} {1, 2, 5, 10}. Each experiment is repeated 50 times with diﬀerent random seeds. The results are presented in Figure 1. Figure 2 shows the comparison in running times. We present the estimation errors for Experiment E2 in Figure 3. The results are summarized in Figures 5. Table 1 shows the obtained MSEs and classiﬁcation accuracies.
Researcher Affiliation	Academia	François Portier EMAIL Department of Statistics, Univ Rennes, Ensai, CNRS, CREST UMR 9194, F-35000 Rennes, France. Lionel Truquet EMAIL Department of Statistics, Univ Rennes, Ensai, CNRS, CREST UMR 9194, F-35000 Rennes, France. Ikko Yamane EMAIL Department of Computer Science, Univ Rennes, Ensai, CNRS, CREST UMR 9194, F-35000 Rennes, France.
Pseudocode	Yes	Algorithm 1 Conditional Sampling Adaptation. Input: Conditional sampler ˆS and target sample (X j )m j=1. Y n,j ˆS(X j ) for each j {1, . . . , m}. // Generate a label conditioned on X j . return m 1 Pm j=1 h(X j , Y n,j). Algorithm 2 k-Nearest Neighbor Conditional Sampler. Input: Source sample (Xi, Yi)n i=1 and target input X j . (i1, . . . , ik) the indices of the k-nearest neighbors of X j among (Xi)n i=1. Pick i {i1, . . . , ik} uniformly at random. return Y n,j := Y i .
Open Source Code	No	The paper mentions using third-party Python modules and toolboxes like 'c KDTree' from SciPy and 'Awesome Domain Adaptation Python Toolbox (ADAPT)' for implementations of other methods, but it does not provide an explicit statement of code release or a link to a repository for the authors' own described methodology.
Open Datasets	Yes	We use regression benchmark datasets, diabetes4, california (Pace and Barry, 1997)5 and classiﬁcation datasets, twonorm (Breiman, 1996)6 and breast_cancer4. 4. Available at https://archive.ics.uci.edu/ml/index.php. 5. Available at https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html. 6. Available at https://www.cs.utoronto.ca/~delve/data/datasets.html.
Dataset Splits	Yes	We conduct experiments in three setups, detailed below, with diﬀerent sample sizes n (= m) and data dimensionalities d: (n, d) {50, 100, 500, 1000, 5000, 10000} {1, 2, 5, 10}. Each experiment is repeated 50 times with diﬀerent random seeds. We split the original to the training and test set and simulate covariate shift by rejection sampling from the test set with rejection probability determined according to the value of a covariate. Table 3: Basic information of the datasets: source sample size n, Target sample size m.
Hardware Specification	No	All the computations were performed on the cluster, Grid5000 (Balouek et al., 2013). This mentions a specific cluster, but it does not provide details about the specific hardware components (CPU models, GPU models, memory) used within that cluster.
Software Dependencies	No	We use the Python module c KDTree (Archibald, 2008) from Sci Py (Virtanen et al., 2020) for nearest neighbor search in our methods. For KMM-W, KLIEP-W, and Ru LSIF-W, we used the implementations from Awesome Domain Adaptation Python Toolbox (ADAPT) (de Mathelin et al., 2021). While software packages are named, specific version numbers for these tools (e.g., SciPy version, ADAPT version) are not provided in the text.
Experiment Setup	Yes	We conduct experiments in three setups, detailed below, with diﬀerent sample sizes n (= m) and data dimensionalities d: (n, d) {50, 100, 500, 1000, 5000, 10000} {1, 2, 5, 10}. Each experiment is repeated 50 times with diﬀerent random seeds. For the methods using Gaussian basis functions (KLIEP-W, KLIEP100-W, Ru LSIF-W, Ru LSIF100-W), we use 5-fold cross-validation for choosing the Gaussian bandwidth from {0.001, 0.01, 0.1, 1, 10}. KMM-W does not oﬀer a way to do cross-validation, and we ﬁxed to 1. In Experiment E3, we perform the ordinary least squares. In Experiment E4, we apply the ridge regression and the logistic regression.