Integral Probability Metrics Meet Neural Networks: The Radon-Kolmogorov-Smirnov Test

Authors: Seunghoon Paik, Michael Celentano, Alden Green, Ryan J. Tibshirani

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that the RKS test has asymptotically full power at distinguishing any distinct pair P = Q of distributions, derive its asymptotic null distribution, and carry out experiments to elucidate the strengths and weaknesses of the RKS test versus the more traditional kernel MMD test. We complement our theory with numerical experiments to explore the operating characteristics of the RKS test compared to other popular nonparametric two-sample tests. 4. Experiments
Researcher Affiliation Academia Seunghoon Paik1 EMAIL Michael Celentano1 EMAIL Alden Green2 EMAIL Ryan J. Tibshirani1 EMAIL 1Department of Statistics, University of California, Berkeley, CA 94720, USA 2Department of Statistics, Stanford University, Stanford, CA 94305, USA
Pseudocode Yes For concreteness, we summarize our computational approach below in Algorithm 1. Algorithm 1 RKS test statistic
Open Source Code Yes Python code to replicate all of our experimental results is available at https://github.com/100shpaik/.
Open Datasets No For each dimension d, we consider five settings for P, Q, which are described in Table 1. In each setting, the parameter v controls the discrepancy between P and Q, but its precise meaning depends on the setting. The settings were broadly chosen in order to study the operating characteristics of the RKS test when differences between P and Q occur in one direction (settings 1 4), and in all directions (setting 5). Among the settings in which the differences occur in one direction, we also investigate different varieties (settings 1 and 2: mean shift under different geometries, setting 3: tail difference, setting 4: variance difference). Figure 2 visualizes samples from drawn each task in d = 2 dimensions, whereas Figure 3 exaggerates the deviation between P, Q (larger values of v) to better illustrate the geometry. Finally, we note that since the RKS test is rotationally invariant,3 the fact that the chosen differences in Table 1 are axis-aligned is just a matter of convenience, and the results would not change if these differences instead occurred along arbitrary directions in Rd. Table 1: Experimental settings. Here Nd(µ, Σ) means the d-dimensional normal distribution with mean µ and covariance Σ, and t(v) means the t distribution with v degrees of freedom.
Dataset Splits No We fix the sample sizes to m = n = 512 throughout, and study four choices of dimension: d = 2, 4, 8, 16. For each setting, we compute these test statistics under the null where each xi and yi are sampled i.i.d. from the mixture m/(m+n) P + n/(m+n) Q, and under the alternative where xi are i.i.d. from P and yi from Q. We then repeat this 100 times (draws of samples, and computation of test statistics), and trace out ROC curves true positive versus false positive rates as we vary the rejection threshold for each test.
Hardware Specification No For k ≥ 1, we apply the torch.optim.Adam optimizer (a variation on gradient descent), as implemented in PyTorch, to (10). For k = 0, such a first-order scheme is not applicable due to the fact that the gradient of the 0th degree ridge spline (wTx − b)0 + = 1{wTx ≥ b} (with respect to w and b) is almost everywhere zero. As a surrogate, we directly approximate the optimum (w*, b*) in (2) using logistic regression where the class labels identify samples from P versus Q, as implemented in sklearn.linear_model.LogisticRegression in Python.
Software Dependencies No For k ≥ 1, we apply the torch.optim.Adam optimizer (a variation on gradient descent), as implemented in PyTorch, to (10). For k = 0, such a first-order scheme is not applicable due to the fact that the gradient of the 0th degree ridge spline (wTx − b)0 + = 1{wTx ≥ b} (with respect to w and b) is almost everywhere zero. As a surrogate, we directly approximate the optimum (w*, b*) in (2) using logistic regression where the class labels identify samples from P versus Q, as implemented in sklearn.linear_model.LogisticRegression in Python.
Experiment Setup Yes For k ≥ 1, we apply the torch.optim.Adam optimizer (a variation on gradient descent), as implemented in PyTorch, to (10). We use a betas parameter (0.9, 0.99), learning rate 0.5, number of iterations T = 200, penalty parameter λ = 1, and number of neurons N = 10. To enforce the nonnegativity condition on b, we project b to [0, ∞) after each gradient step. Rather than take the last iterate, we choose the maximal IPM values among the iterates (after rescaling by the RTVk seminorm of each iterate so that it lies in the unit seminorm ball). Further, we repeat this over three random initializations, and select the best resultant IPM value to be the final output. We fix the sample sizes to m = n = 512 throughout, and study four choices of dimension: d = 2, 4, 8, 16. For the RKS tests, we examine smoothness degrees k = 0, 1, 2, 3, and we center the input data to have the sample mean zero jointly across both samples.