Two Sample Testing in High Dimension via Maximum Mean Discrepancy
Authors: Hanjia Gao, Xiaofeng Shao
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical simulations demonstrate the effectiveness of our proposed test statistic and normal approximation. |
| Researcher Affiliation | Academia | Hanjia Gao EMAIL Department of Statistics University of Illinois at Urbana-Champaign Champaign, IL 61820-5711, USA Xiaofeng Shao EMAIL Department of Statistics University of Illinois at Urbana-Champaign Champaign, IL 61820-5711, USA |
| Pseudocode | No | The paper describes methods and theoretical derivations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The online supplement is available at https://arxiv. org/abs/2109.14913. This link is to an arXiv preprint containing additional technical details and simulation results, but it does not explicitly state that source code for the methodology is provided. |
| Open Datasets | No | The paper generates samples from specified distributions (e.g., "Generate independent samples: X1, . . . , Xn iid N(0, Σ)" or "Generate independent samples: X1, . . . , Xn iid (V 1/2ΣV 1/2)1/2ZX"). It does not use or provide access to any publicly available datasets. |
| Dataset Splits | No | The paper describes generation of samples for simulations (e.g., "n {25, 50, 100, 200, 400} and the data dimensionality p {25, 50, 100, 200}"). Since the data is generated for each experiment, there are no predefined training/test/validation splits discussed or provided. |
| Hardware Specification | No | Table 4: Computational cost under multiple settings. All the numerical results are counted in seconds. The paper mentions computation time but does not specify any hardware details like GPU/CPU models or memory used for the experiments. |
| Software Dependencies | No | The paper mentions running simulations and using Monte Carlo replications and permutations, but it does not list any specific software libraries, packages, or programming languages with version numbers that were used for the implementation. |
| Experiment Setup | Yes | Example 1: Generate independent samples: X1, . . . , Xn iid N(0, Σ), Y1, . . . , Ym iid N(0, Σ), where Σ = (σij) Rp p with σij = ρ|i j| and ρ = 0.5. We set the sample size ratio m/n = 1, and consider the setting that n {25, 50, 100, 200, 400} and the data dimensionality p {25, 50, 100, 200}. As for the kernel k, we consider the L2-norm k L2(x, y) = |x y|, the Gaussian kernel multiplied by -1, that is,, k G(x, y) = exp |x y|2/(2γ2) with γ2 = Median{|Xi1 Xi2|2, |Xi Yj|2, |Yj1 Yj2|2}, and the Laplacian kernel multiplied by 1, that is,, k L(x, y) = exp ( |x y|/γ) with γ = Median{|Xi1 Xi2|, |Xi Yj|, |Yj1 Yj2|}. The median heuristic is a popular way of choosing γ; see Gretton et al. (2012). Example 2: ...consider the setting that (n, m) {(25, 25), (50, 50), (50, 100), (100, 100), (200, 200)} and p {50, 100}. Here, V is a diagonal matrix with V 1/2 ii = 1 or uniformly drawn from the interval (1, 5). ZX, ZY are iid copies of Z drawn from the following two distributions: (i) Z = (z1, . . . , zp) with z1, . . . , zp iid N(0, 1). (ii) Z = (z1 1, . . . , zp 1) with z1, . . . , zp iid Exponential(1). Example 3: ...consider (n, m) {(25, 25), (50, 50), (100, 100), (200, 200)}, p {50, 100} and β {0, 0.1, . . . , 1}. Throughout the simulations, our proposed methods are averaged over 5000 Monte Carlo replications, whereas those of the permutation tests are averaged over 1000 Monte Carlo replications... 300 permutations are conducted for each replication. Under the significance level α = 0.05, we reject the null hypothesis if T k n,m,p > Φ(1 α). |