MMD Two-sample Testing in the Presence of Arbitrarily Missing Data
Authors: Yijin Zeng, Niall M. Adams, Dean A. Bodenham
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulation results show that the method has good statistical power, typically for cases where 5% to 10% of the data are missing. We highlight the value of this approach when the data are missing not at random, a context in which either ignoring the missing values or using common imputation methods may not control the Type I error. |
| Researcher Affiliation | Academia | Yijin Zeng EMAIL Imperial College London Niall Adams EMAIL Imperial College London Dean Bodenham EMAIL Imperial College London |
| Pseudocode | No | The paper provides mathematical derivations and proofs (Lemmas, Theorems) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We evaluate the performance of MMD-Miss on real-world data using MNIST images Le Cun et al. (1998), with examples shown in Figure 4 in Appendix B.4. |
| Dataset Splits | No | The paper describes how samples for X and Y are generated (e.g., from specific labels for MNIST) and how missingness is introduced, but it does not specify explicit train/test/validation dataset splits typically used for model training. The experiments are two-sample hypothesis tests, not machine learning model training with such splits. |
| Hardware Specification | Yes | The experiments were run on an high performance computing cluster with 325 compute nodes, each equipped with 2x AMD EPYC 7742 processors (128 cores, 1TB RAM per node). |
| Software Dependencies | No | The paper mentions 'R package version, 1:21, 2013' for the 'Miss Forest' method, which is a comparative tool, not a software dependency for their own proposed method. It does not list specific version numbers for software used to implement MMD-Miss. |
| Experiment Setup | Yes | For MMD-Miss, the parameter β in the Laplacian kernel is chosen using the median heuristic, which generally works well (Gretton et al., 2012a; Bodenham & Kawahara, 2023) and is described in Appendix B.2. The number of permutations used for MMD-Perm and the imputation methods is set to B = 100, as described in Appendix B.2. |