Two Sample Testing in High Dimension via Maximum Mean Discrepancy

Authors: Hanjia Gao, Xiaofeng Shao

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical simulations demonstrate the effectiveness of our proposed test statistic and normal approximation.
Researcher Affiliation Academia Hanjia Gao EMAIL Department of Statistics University of Illinois at Urbana-Champaign Champaign, IL 61820-5711, USA Xiaofeng Shao EMAIL Department of Statistics University of Illinois at Urbana-Champaign Champaign, IL 61820-5711, USA
Pseudocode No The paper describes methods and theoretical derivations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The online supplement is available at https://arxiv. org/abs/2109.14913. This link is to an arXiv preprint containing additional technical details and simulation results, but it does not explicitly state that source code for the methodology is provided.
Open Datasets No The paper generates samples from specified distributions (e.g., "Generate independent samples: X1, . . . , Xn iid N(0, Σ)" or "Generate independent samples: X1, . . . , Xn iid (V 1/2ΣV 1/2)1/2ZX"). It does not use or provide access to any publicly available datasets.
Dataset Splits No The paper describes generation of samples for simulations (e.g., "n {25, 50, 100, 200, 400} and the data dimensionality p {25, 50, 100, 200}"). Since the data is generated for each experiment, there are no predefined training/test/validation splits discussed or provided.
Hardware Specification No Table 4: Computational cost under multiple settings. All the numerical results are counted in seconds. The paper mentions computation time but does not specify any hardware details like GPU/CPU models or memory used for the experiments.
Software Dependencies No The paper mentions running simulations and using Monte Carlo replications and permutations, but it does not list any specific software libraries, packages, or programming languages with version numbers that were used for the implementation.
Experiment Setup Yes Example 1: Generate independent samples: X1, . . . , Xn iid N(0, Σ), Y1, . . . , Ym iid N(0, Σ), where Σ = (σij) Rp p with σij = ρ|i j| and ρ = 0.5. We set the sample size ratio m/n = 1, and consider the setting that n {25, 50, 100, 200, 400} and the data dimensionality p {25, 50, 100, 200}. As for the kernel k, we consider the L2-norm k L2(x, y) = |x y|, the Gaussian kernel multiplied by -1, that is,, k G(x, y) = exp |x y|2/(2γ2) with γ2 = Median{|Xi1 Xi2|2, |Xi Yj|2, |Yj1 Yj2|2}, and the Laplacian kernel multiplied by 1, that is,, k L(x, y) = exp ( |x y|/γ) with γ = Median{|Xi1 Xi2|, |Xi Yj|, |Yj1 Yj2|}. The median heuristic is a popular way of choosing γ; see Gretton et al. (2012). Example 2: ...consider the setting that (n, m) {(25, 25), (50, 50), (50, 100), (100, 100), (200, 200)} and p {50, 100}. Here, V is a diagonal matrix with V 1/2 ii = 1 or uniformly drawn from the interval (1, 5). ZX, ZY are iid copies of Z drawn from the following two distributions: (i) Z = (z1, . . . , zp) with z1, . . . , zp iid N(0, 1). (ii) Z = (z1 1, . . . , zp 1) with z1, . . . , zp iid Exponential(1). Example 3: ...consider (n, m) {(25, 25), (50, 50), (100, 100), (200, 200)}, p {50, 100} and β {0, 0.1, . . . , 1}. Throughout the simulations, our proposed methods are averaged over 5000 Monte Carlo replications, whereas those of the permutation tests are averaged over 1000 Monte Carlo replications... 300 permutations are conducted for each replication. Under the significance level α = 0.05, we reject the null hypothesis if T k n,m,p > Φ(1 α).