MMD Two-sample Testing in the Presence of Arbitrarily Missing Data

Authors: Yijin Zeng, Niall M. Adams, Dean A. Bodenham

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulation results show that the method has good statistical power, typically for cases where 5% to 10% of the data are missing. We highlight the value of this approach when the data are missing not at random, a context in which either ignoring the missing values or using common imputation methods may not control the Type I error.
Researcher Affiliation Academia Yijin Zeng EMAIL Imperial College London Niall Adams EMAIL Imperial College London Dean Bodenham EMAIL Imperial College London
Pseudocode No The paper provides mathematical derivations and proofs (Lemmas, Theorems) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets Yes We evaluate the performance of MMD-Miss on real-world data using MNIST images Le Cun et al. (1998), with examples shown in Figure 4 in Appendix B.4.
Dataset Splits No The paper describes how samples for X and Y are generated (e.g., from specific labels for MNIST) and how missingness is introduced, but it does not specify explicit train/test/validation dataset splits typically used for model training. The experiments are two-sample hypothesis tests, not machine learning model training with such splits.
Hardware Specification Yes The experiments were run on an high performance computing cluster with 325 compute nodes, each equipped with 2x AMD EPYC 7742 processors (128 cores, 1TB RAM per node).
Software Dependencies No The paper mentions 'R package version, 1:21, 2013' for the 'Miss Forest' method, which is a comparative tool, not a software dependency for their own proposed method. It does not list specific version numbers for software used to implement MMD-Miss.
Experiment Setup Yes For MMD-Miss, the parameter β in the Laplacian kernel is chosen using the median heuristic, which generally works well (Gretton et al., 2012a; Bodenham & Kawahara, 2023) and is described in Appendix B.2. The number of permutations used for MMD-Perm and the imputation methods is set to B = 100, as described in Appendix B.2.