reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMD Two-sample Testing in the Presence of Arbitrarily Missing Data

Authors: Yijin Zeng, Niall M. Adams, Dean A. Bodenham

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Simulation results show that the method has good statistical power, typically for cases where 5% to 10% of the data are missing. We highlight the value of this approach when the data are missing not at random, a context in which either ignoring the missing values or using common imputation methods may not control the Type I error.
Researcher Affiliation	Academia	Yijin Zeng EMAIL Imperial College London Niall Adams EMAIL Imperial College London Dean Bodenham EMAIL Imperial College London
Pseudocode	No	The paper provides mathematical derivations and proofs (Lemmas, Theorems) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We evaluate the performance of MMD-Miss on real-world data using MNIST images Le Cun et al. (1998), with examples shown in Figure 4 in Appendix B.4.
Dataset Splits	No	The paper describes how samples for X and Y are generated (e.g., from specific labels for MNIST) and how missingness is introduced, but it does not specify explicit train/test/validation dataset splits typically used for model training. The experiments are two-sample hypothesis tests, not machine learning model training with such splits.
Hardware Specification	Yes	The experiments were run on an high performance computing cluster with 325 compute nodes, each equipped with 2x AMD EPYC 7742 processors (128 cores, 1TB RAM per node).
Software Dependencies	No	The paper mentions 'R package version, 1:21, 2013' for the 'Miss Forest' method, which is a comparative tool, not a software dependency for their own proposed method. It does not list specific version numbers for software used to implement MMD-Miss.
Experiment Setup	Yes	For MMD-Miss, the parameter β in the Laplacian kernel is chosen using the median heuristic, which generally works well (Gretton et al., 2012a; Bodenham & Kawahara, 2023) and is described in Appendix B.2. The number of permutations used for MMD-Perm and the imputation methods is set to B = 100, as described in Appendix B.2.