reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generative Classifiers Avoid Shortcut Solutions

Authors: Alexander Li, Ananya Kumar, Deepak Pathak

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run experiments on standard distribution shift benchmarks across image and text domains and find that generative classifiers consistently do better under distribution shift than discriminative approaches.
Researcher Affiliation	Academia	Alexander C. Li Carnegie Mellon University EMAIL Ananya Kumar Stanford University EMAIL Deepak Pathak Carnegie Mellon University EMAIL
Pseudocode	Yes	Algorithm 1 gives an overview of the generative classification procedure.
Open Source Code	No	The paper does not provide an explicit statement of code release by the authors for their methodology, nor does it include a link to a code repository. It mentions using "the official training codebase released by Koh et al. (2021)" for baselines, but this is a third-party tool.
Open Datasets	Yes	We consider classification under two types of distribution shift. In subpopulation shift... on Celeb A (Liu et al., 2015)... We also consider domain shift... Camelyon17-WILDS (Koh et al., 2021)... We examine 5 common distribution shift benchmarks in total: besides Celeb A and Camelyon, we use Waterbirds ( Sagawa et al. (2019); subpopulation shift), FMo W (Koh et al. (2021); both subpopulation and domain shift), and Civil Comments (Koh et al. (2021); subpopulation shift). ...We additionally run experiments on two highly-used subpopulation shift benchmarks from BREEDS (Santurkar et al., 2020): Living-17 (with 17 animal classes) and Entity-30 (with 30 classes).
Dataset Splits	Yes	We use five standard benchmarks for distribution shift. Camelyon undergoes domain shift, so we report its OOD accuracy on the test data. Waterbirds, Celeb A, and Civil Comments undergo subpopulation shift, so we report worst group accuracy. FMo W has both subpopulation shift over regions and a domain shift across time, so we report OOD worst group accuracy.
Hardware Specification	Yes	Each diffusion model requires about 3 A6000 days to train.
Software Dependencies	No	The paper mentions using Adam W and a Llama-style architecture, but does not provide specific version numbers for software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	We train it from scratch with Adam W (Loshchilov & Hutter, 2017) with a constant base learning rate of 1e-6 and no weight decay or dropout. We did not tune diffusion model hyperparameters and simply used the default settings for conditional image generation. ... For training, we pad shorter sequences to a length of 512 and only compute loss for non-padded tokens. We train for up to 200k iterations... we sweep over learning rate, weight decay, and dropout based on their effect on the data log-likelihood.