E-Valuating Classifier Two-Sample Tests
Authors: Teodora Pandeva, Tim Bakker, Christian A. Naesseth, Patrick Forré
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our empirical analysis in Section 6, we use the theoretical properties of E-C2ST to design sequential tests that optimize data usage by segmenting it into multiple batches. Each batch contributes to the cumulative test statistic. This method contrasts with traditional two-sample classifier tests, which derive a test statistic solely from the test set conditioned on the training data. Our approach not only achieves maximum power faster than standard methods, but also consistently keeps type I errors well below the significance level. 6 Experiments We compare our method to other classifier two-sample tests on the Blob, MNIST, and KDEF data. We will empirically show that we can exploit the ability of E-C2ST to construct test statistics by using the entire dataset to gain statistical power over the other tests that compute a p-value based only on the train-test data split. Meanwhile, E-C2ST keeps the type I error strictly below the alpha significance level. |
| Researcher Affiliation | Academia | Teodora Pandeva EMAIL University of Amsterdam Tim Bakker EMAIL University of Amsterdam Christian A. Naesseth EMAIL University of Amsterdam Patrick Forré EMAIL University of Amsterdam |
| Pseudocode | Yes | Algorithm 1 Algorithmic description of E-C2ST. |
| Open Source Code | No | Code will be provided upon acceptance. |
| Open Datasets | Yes | We compare our method to other classifier two-sample tests on the Blob, MNIST, and KDEF data. KDEF Data. The Karolinska Directed Emotional Faces (KDEF) dataset (Lundqvist et al., 1998) is used by Jitkrittum et al. (2016); Lopez-Paz and Oquab (2017); Kirchler et al. (2020) to distinguish between positive (happy, neutral, surprised) and negative (afraid, angry, disgusted) emotions from faces. Corrupted MNIST Data. The MNIST dataset (Le Cun et al., 1998) consists of 70 000 handwritten digits. |
| Dataset Splits | Yes | In the baseline case, we split the dataset into train, validation and test sets with ratio 5:1:1 and we fit a classifier. For L-C2ST, the total sample size shown in the second row of the Table 1 was divided according to a ratio we used in all our experiments, i.e. 5:1:1 for training, validation and test data, respectively. |
| Hardware Specification | Yes | All experimental runs were performed on NVIDIA GP102 [Ge Force GTX 1080 Ti] GPUs. |
| Software Dependencies | Yes | We used Adam optimizer (Kingma and Ba, 2015) with learning rate 1e-4 (and 5e-4 for the Blob data). For fitting the parameter λ from (10) we used L-BFGS-B (Byrd et al., 1995) implemented in (Virtanen et al., 2020). Virtanen et al. (2020). Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. |
| Experiment Setup | Yes | We used Adam optimizer (Kingma and Ba, 2015) with learning rate 1e-4 (and 5e-4 for the Blob data). For fitting the parameter λ from (10) we used L-BFGS-B (Byrd et al., 1995) implemented in (Virtanen et al., 2020) and we set the initial value to 0.5 unless specified otherwise. The used network architectures are described in Table 2. We trained the models with early stopping with patience 20 for all methods in all cases. |