reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

Authors: Parastoo PASHMCHI, Jérôme Benoit, Motonobu Kanagawa

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments illustrate the performance of k NNSampler. The code for k NNSampler is made publicly available.1 We report experimental results on synthetic data in Section 4 and on real solar-power data in Section 5.
Researcher Affiliation	Collaboration	Parastoo Pashmchi EMAIL SAP Labs France E-Mobility Research EURECOM, Sophia Antipolis, France Jérôme Benoit EMAIL SAP Labs France E-Mobility Research Motonobu Kanagawa EMAIL EURECOM, Sophia Antipolis, France
Pseudocode	Yes	Algorithm 1: k NNSampler Input: Number of nearest neighbors k, observed covariates x1, . . . , xm X with missing responses, observed covariate-response pairs (x1, y1), . . . , (xn, yn) X Y. Output: Imputed responses ˆy1,imp, . . . , ˆym,imp Y. for i = 1 to m do ˆyi,imp := yj, where j {1, . . . , n} is uniformly sampled from NN( xi, k, Xn) in equation 4, the indices of the k-nearest neighbors of xi in Xn = {x1, . . . , xn}. end
Open Source Code	Yes	The code for k NNSampler is made publicly available.1 1https://github.com/SAP/knn-sampler
Open Datasets	Yes	We use a Kaggle dataset8 that contains solar panel DC powers (responses) and the corresponding irradiations (covariates), totaling 67,698 covariate-response pairs. 8https://www.kaggle.com/datasets/samuelkamau/solar-data/
Dataset Splits	Yes	The number k of nearest neighbors k is a hyperparameter of k NNSampler. The theoretical and empirical results below indicate that k should not be fixed to a prespecified value (e.g., k = 5), and should be chosen depending on the available data. One way is to perform cross-validation for k NN regression on the data (x1, y1), . . . , (xn, yn) and select k among candidate values that minimizes the mean-square error on held-out observed responses, averaged over different training-validation splits. In particular, the present work uses Leave-One-Out Cross-Validation (LOOCV) using the fast computation method recently proposed by Kanagawa (2024).
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or memory) are mentioned in the paper for running experiments.
Software Dependencies	No	k NNImputer (Troyanskaya et al., 2001) is one of the most widely used imputation methods, owing to its simplicity and availability in popular software packages such as scikit-learn2 (Pedregosa et al., 2011). The paper mentions 'scikit-learn' but does not specify its version number or versions for other software dependencies.
Experiment Setup	Yes	We set the number k of nearest neighbours as k = 5, which is the default setting in scikit-learn and widely used in practice. We use the authors recommended settings: inverse temperature τ = 50 and kernel bandwidth h = 0.03. The number k of nearest neighbours for k NNSampler is determined by the fast leaveone-out cross-validation method of Kanagawa (2024) using the observed covariate-response pairs. Specifically, we set n {2800, 4800, 6800, 8800, 10800}.