Consistent Distribution-Free $K$-Sample and Independence Tests for Univariate Random Variables
Authors: Ruth Heller, Yair Heller, Shachar Kaufman, Barak Brill, Malka Gorfine
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that the power of the regularized tests is excellent compared to existing tests, and almost as powerful as the tests based on the optimal (yet unknown in practice) partition size, in simulations as well as on a real data example. [...] In simulations, we compared the power of our different test statistics in a wide range of scenarios. [...] We used 20,000 simulated data sets, in each of the configurations of Figure 2. [...] In Section 5 we analyze the yeast gene expression data set from Hughes et al. (2000). |
| Researcher Affiliation | Academia | Ruth Heller EMAIL Department of Statistics and Operations Research Tel Aviv University Tel Aviv 69978, Israel [...] Shachar Kaufman EMAIL Barak Brill EMAIL Malka Gorfine EMAIL Department of Statistics and Operations Research Tel Aviv University Tel Aviv 69978, Israel |
| Pseudocode | No | We present innovative algorithms for the computation of the tests, which are essential for large m since the computational complexity of the naive algorithm is exponential in m. The algorithms are described textually (e.g., 'The algorithm proceeds as follows.') but no structured pseudocode blocks are present. |
| Open Source Code | Yes | Efficient implementations of all statistics and tests described herein are available in the R package HHG, which can be freely downloaded from the Comprehensive R Archive Network, http://cran.r-project.org/. |
| Open Datasets | Yes | In Section 5 we analyze the yeast gene expression data set from Hughes et al. (2000). |
| Dataset Splits | No | The paper mentions using '20,000 simulated data sets' and analyzing 'the yeast gene expression data set from Hughes et al. (2000)' with N=300 expression levels. However, it does not specify any training/test/validation splits for these datasets, nor does it provide details on how the data was partitioned for model evaluation beyond the statistical testing methodology itself. |
| Hardware Specification | No | The paper discusses the computational complexity of the algorithms (e.g., O(N^2), O(N^4)) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or simulations. |
| Software Dependencies | No | The paper states that 'Efficient implementations of all statistics and tests described herein are available in the R package HHG', but it does not specify a version number for the R package HHG or any other software dependencies. |
| Experiment Setup | Yes | All tests were performed at the 0.05 significance level. Look-up tables of the quantiles of the null distributions of the test statistics for a given N were stored. Power was estimated by the fraction of test statistics that were at least as large as the 95th percentile of the null distribution. The null tables were based on 10^6 permutations. The noise level was chosen separately for each configuration and sample size, so that the power is reasonable for at least some of the variants. We used 20,000 simulated data sets, in each of the configurations of Figure 2. |