kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions
Authors: Parastoo PASHMCHI, Jérôme Benoit, Motonobu Kanagawa
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments illustrate the performance of k NNSampler. The code for k NNSampler is made publicly available.1 We report experimental results on synthetic data in Section 4 and on real solar-power data in Section 5. |
| Researcher Affiliation | Collaboration | Parastoo Pashmchi EMAIL SAP Labs France E-Mobility Research EURECOM, Sophia Antipolis, France Jérôme Benoit EMAIL SAP Labs France E-Mobility Research Motonobu Kanagawa EMAIL EURECOM, Sophia Antipolis, France |
| Pseudocode | Yes | Algorithm 1: k NNSampler Input: Number of nearest neighbors k, observed covariates x1, . . . , xm X with missing responses, observed covariate-response pairs (x1, y1), . . . , (xn, yn) X Y. Output: Imputed responses ˆy1,imp, . . . , ˆym,imp Y. for i = 1 to m do ˆyi,imp := yj, where j {1, . . . , n} is uniformly sampled from NN( xi, k, Xn) in equation 4, the indices of the k-nearest neighbors of xi in Xn = {x1, . . . , xn}. end |
| Open Source Code | Yes | The code for k NNSampler is made publicly available.1 1https://github.com/SAP/knn-sampler |
| Open Datasets | Yes | We use a Kaggle dataset8 that contains solar panel DC powers (responses) and the corresponding irradiations (covariates), totaling 67,698 covariate-response pairs. 8https://www.kaggle.com/datasets/samuelkamau/solar-data/ |
| Dataset Splits | Yes | The number k of nearest neighbors k is a hyperparameter of k NNSampler. The theoretical and empirical results below indicate that k should not be fixed to a prespecified value (e.g., k = 5), and should be chosen depending on the available data. One way is to perform cross-validation for k NN regression on the data (x1, y1), . . . , (xn, yn) and select k among candidate values that minimizes the mean-square error on held-out observed responses, averaged over different training-validation splits. In particular, the present work uses Leave-One-Out Cross-Validation (LOOCV) using the fast computation method recently proposed by Kanagawa (2024). |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory) are mentioned in the paper for running experiments. |
| Software Dependencies | No | k NNImputer (Troyanskaya et al., 2001) is one of the most widely used imputation methods, owing to its simplicity and availability in popular software packages such as scikit-learn2 (Pedregosa et al., 2011). The paper mentions 'scikit-learn' but does not specify its version number or versions for other software dependencies. |
| Experiment Setup | Yes | We set the number k of nearest neighbours as k = 5, which is the default setting in scikit-learn and widely used in practice. We use the authors recommended settings: inverse temperature τ = 50 and kernel bandwidth h = 0.03. The number k of nearest neighbours for k NNSampler is determined by the fast leaveone-out cross-validation method of Kanagawa (2024) using the observed covariate-response pairs. Specifically, we set n {2800, 4800, 6800, 8800, 10800}. |