Generative Classifiers Avoid Shortcut Solutions

Authors: Alexander Li, Ananya Kumar, Deepak Pathak

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We run experiments on standard distribution shift benchmarks across image and text domains and find that generative classifiers consistently do better under distribution shift than discriminative approaches.
Researcher Affiliation Academia Alexander C. Li Carnegie Mellon University EMAIL Ananya Kumar Stanford University EMAIL Deepak Pathak Carnegie Mellon University EMAIL
Pseudocode Yes Algorithm 1 gives an overview of the generative classification procedure.
Open Source Code No The paper does not provide an explicit statement of code release by the authors for their methodology, nor does it include a link to a code repository. It mentions using "the official training codebase released by Koh et al. (2021)" for baselines, but this is a third-party tool.
Open Datasets Yes We consider classification under two types of distribution shift. In subpopulation shift... on Celeb A (Liu et al., 2015)... We also consider domain shift... Camelyon17-WILDS (Koh et al., 2021)... We examine 5 common distribution shift benchmarks in total: besides Celeb A and Camelyon, we use Waterbirds ( Sagawa et al. (2019); subpopulation shift), FMo W (Koh et al. (2021); both subpopulation and domain shift), and Civil Comments (Koh et al. (2021); subpopulation shift). ...We additionally run experiments on two highly-used subpopulation shift benchmarks from BREEDS (Santurkar et al., 2020): Living-17 (with 17 animal classes) and Entity-30 (with 30 classes).
Dataset Splits Yes We use five standard benchmarks for distribution shift. Camelyon undergoes domain shift, so we report its OOD accuracy on the test data. Waterbirds, Celeb A, and Civil Comments undergo subpopulation shift, so we report worst group accuracy. FMo W has both subpopulation shift over regions and a domain shift across time, so we report OOD worst group accuracy.
Hardware Specification Yes Each diffusion model requires about 3 A6000 days to train.
Software Dependencies No The paper mentions using Adam W and a Llama-style architecture, but does not provide specific version numbers for software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes We train it from scratch with Adam W (Loshchilov & Hutter, 2017) with a constant base learning rate of 1e-6 and no weight decay or dropout. We did not tune diffusion model hyperparameters and simply used the default settings for conditional image generation. ... For training, we pad shorter sequences to a length of 512 and only compute loss for non-padded tokens. We train for up to 200k iterations... we sweep over learning rate, weight decay, and dropout based on their effect on the data log-likelihood.