kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

Authors: Parastoo PASHMCHI, Jérôme Benoit, Motonobu Kanagawa

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments illustrate the performance of k NNSampler. The code for k NNSampler is made publicly available.1 We report experimental results on synthetic data in Section 4 and on real solar-power data in Section 5.
Researcher Affiliation Collaboration Parastoo Pashmchi EMAIL SAP Labs France E-Mobility Research EURECOM, Sophia Antipolis, France Jérôme Benoit EMAIL SAP Labs France E-Mobility Research Motonobu Kanagawa EMAIL EURECOM, Sophia Antipolis, France
Pseudocode Yes Algorithm 1: k NNSampler Input: Number of nearest neighbors k, observed covariates x1, . . . , xm X with missing responses, observed covariate-response pairs (x1, y1), . . . , (xn, yn) X Y. Output: Imputed responses ˆy1,imp, . . . , ˆym,imp Y. for i = 1 to m do ˆyi,imp := yj, where j {1, . . . , n} is uniformly sampled from NN( xi, k, Xn) in equation 4, the indices of the k-nearest neighbors of xi in Xn = {x1, . . . , xn}. end
Open Source Code Yes The code for k NNSampler is made publicly available.1 1https://github.com/SAP/knn-sampler
Open Datasets Yes We use a Kaggle dataset8 that contains solar panel DC powers (responses) and the corresponding irradiations (covariates), totaling 67,698 covariate-response pairs. 8https://www.kaggle.com/datasets/samuelkamau/solar-data/
Dataset Splits Yes The number k of nearest neighbors k is a hyperparameter of k NNSampler. The theoretical and empirical results below indicate that k should not be fixed to a prespecified value (e.g., k = 5), and should be chosen depending on the available data. One way is to perform cross-validation for k NN regression on the data (x1, y1), . . . , (xn, yn) and select k among candidate values that minimizes the mean-square error on held-out observed responses, averaged over different training-validation splits. In particular, the present work uses Leave-One-Out Cross-Validation (LOOCV) using the fast computation method recently proposed by Kanagawa (2024).
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory) are mentioned in the paper for running experiments.
Software Dependencies No k NNImputer (Troyanskaya et al., 2001) is one of the most widely used imputation methods, owing to its simplicity and availability in popular software packages such as scikit-learn2 (Pedregosa et al., 2011). The paper mentions 'scikit-learn' but does not specify its version number or versions for other software dependencies.
Experiment Setup Yes We set the number k of nearest neighbours as k = 5, which is the default setting in scikit-learn and widely used in practice. We use the authors recommended settings: inverse temperature τ = 50 and kernel bandwidth h = 0.03. The number k of nearest neighbours for k NNSampler is determined by the fast leaveone-out cross-validation method of Kanagawa (2024) using the observed covariate-response pairs. Specifically, we set n {2800, 4800, 6800, 8800, 10800}.