reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Two Sample Testing in High Dimension via Maximum Mean Discrepancy

Authors: Hanjia Gao, Xiaofeng Shao

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical simulations demonstrate the effectiveness of our proposed test statistic and normal approximation.
Researcher Affiliation	Academia	Hanjia Gao EMAIL Department of Statistics University of Illinois at Urbana-Champaign Champaign, IL 61820-5711, USA Xiaofeng Shao EMAIL Department of Statistics University of Illinois at Urbana-Champaign Champaign, IL 61820-5711, USA
Pseudocode	No	The paper describes methods and theoretical derivations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The online supplement is available at https://arxiv. org/abs/2109.14913. This link is to an arXiv preprint containing additional technical details and simulation results, but it does not explicitly state that source code for the methodology is provided.
Open Datasets	No	The paper generates samples from specified distributions (e.g., "Generate independent samples: X1, . . . , Xn iid N(0, Σ)" or "Generate independent samples: X1, . . . , Xn iid (V 1/2ΣV 1/2)1/2ZX"). It does not use or provide access to any publicly available datasets.
Dataset Splits	No	The paper describes generation of samples for simulations (e.g., "n {25, 50, 100, 200, 400} and the data dimensionality p {25, 50, 100, 200}"). Since the data is generated for each experiment, there are no predefined training/test/validation splits discussed or provided.
Hardware Specification	No	Table 4: Computational cost under multiple settings. All the numerical results are counted in seconds. The paper mentions computation time but does not specify any hardware details like GPU/CPU models or memory used for the experiments.
Software Dependencies	No	The paper mentions running simulations and using Monte Carlo replications and permutations, but it does not list any specific software libraries, packages, or programming languages with version numbers that were used for the implementation.
Experiment Setup	Yes	Example 1: Generate independent samples: X1, . . . , Xn iid N(0, Σ), Y1, . . . , Ym iid N(0, Σ), where Σ = (σij) Rp p with σij = ρ\|i j\| and ρ = 0.5. We set the sample size ratio m/n = 1, and consider the setting that n {25, 50, 100, 200, 400} and the data dimensionality p {25, 50, 100, 200}. As for the kernel k, we consider the L2-norm k L2(x, y) = \|x y\|, the Gaussian kernel multiplied by -1, that is,, k G(x, y) = exp \|x y\|2/(2γ2) with γ2 = Median{\|Xi1 Xi2\|2, \|Xi Yj\|2, \|Yj1 Yj2\|2}, and the Laplacian kernel multiplied by 1, that is,, k L(x, y) = exp ( \|x y\|/γ) with γ = Median{\|Xi1 Xi2\|, \|Xi Yj\|, \|Yj1 Yj2\|}. The median heuristic is a popular way of choosing γ; see Gretton et al. (2012). Example 2: ...consider the setting that (n, m) {(25, 25), (50, 50), (50, 100), (100, 100), (200, 200)} and p {50, 100}. Here, V is a diagonal matrix with V 1/2 ii = 1 or uniformly drawn from the interval (1, 5). ZX, ZY are iid copies of Z drawn from the following two distributions: (i) Z = (z1, . . . , zp) with z1, . . . , zp iid N(0, 1). (ii) Z = (z1 1, . . . , zp 1) with z1, . . . , zp iid Exponential(1). Example 3: ...consider (n, m) {(25, 25), (50, 50), (100, 100), (200, 200)}, p {50, 100} and β {0, 0.1, . . . , 1}. Throughout the simulations, our proposed methods are averaged over 5000 Monte Carlo replications, whereas those of the permutation tests are averaged over 1000 Monte Carlo replications... 300 permutations are conducted for each replication. Under the significance level α = 0.05, we reject the null hypothesis if T k n,m,p > Φ(1 α).