reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

E-Valuating Classifier Two-Sample Tests

Authors: Teodora Pandeva, Tim Bakker, Christian A. Naesseth, Patrick Forré

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our empirical analysis in Section 6, we use the theoretical properties of E-C2ST to design sequential tests that optimize data usage by segmenting it into multiple batches. Each batch contributes to the cumulative test statistic. This method contrasts with traditional two-sample classifier tests, which derive a test statistic solely from the test set conditioned on the training data. Our approach not only achieves maximum power faster than standard methods, but also consistently keeps type I errors well below the significance level. 6 Experiments We compare our method to other classifier two-sample tests on the Blob, MNIST, and KDEF data. We will empirically show that we can exploit the ability of E-C2ST to construct test statistics by using the entire dataset to gain statistical power over the other tests that compute a p-value based only on the train-test data split. Meanwhile, E-C2ST keeps the type I error strictly below the alpha significance level.
Researcher Affiliation	Academia	Teodora Pandeva EMAIL University of Amsterdam Tim Bakker EMAIL University of Amsterdam Christian A. Naesseth EMAIL University of Amsterdam Patrick Forré EMAIL University of Amsterdam
Pseudocode	Yes	Algorithm 1 Algorithmic description of E-C2ST.
Open Source Code	No	Code will be provided upon acceptance.
Open Datasets	Yes	We compare our method to other classifier two-sample tests on the Blob, MNIST, and KDEF data. KDEF Data. The Karolinska Directed Emotional Faces (KDEF) dataset (Lundqvist et al., 1998) is used by Jitkrittum et al. (2016); Lopez-Paz and Oquab (2017); Kirchler et al. (2020) to distinguish between positive (happy, neutral, surprised) and negative (afraid, angry, disgusted) emotions from faces. Corrupted MNIST Data. The MNIST dataset (Le Cun et al., 1998) consists of 70 000 handwritten digits.
Dataset Splits	Yes	In the baseline case, we split the dataset into train, validation and test sets with ratio 5:1:1 and we fit a classifier. For L-C2ST, the total sample size shown in the second row of the Table 1 was divided according to a ratio we used in all our experiments, i.e. 5:1:1 for training, validation and test data, respectively.
Hardware Specification	Yes	All experimental runs were performed on NVIDIA GP102 [Ge Force GTX 1080 Ti] GPUs.
Software Dependencies	Yes	We used Adam optimizer (Kingma and Ba, 2015) with learning rate 1e-4 (and 5e-4 for the Blob data). For fitting the parameter λ from (10) we used L-BFGS-B (Byrd et al., 1995) implemented in (Virtanen et al., 2020). Virtanen et al. (2020). Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python.
Experiment Setup	Yes	We used Adam optimizer (Kingma and Ba, 2015) with learning rate 1e-4 (and 5e-4 for the Blob data). For fitting the parameter λ from (10) we used L-BFGS-B (Byrd et al., 1995) implemented in (Virtanen et al., 2020) and we set the initial value to 0.5 unless specified otherwise. The used network architectures are described in Table 2. We trained the models with early stopping with patience 20 for all methods in all cases.