Generative Classifiers Avoid Shortcut Solutions
Authors: Alexander Li, Ananya Kumar, Deepak Pathak
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We run experiments on standard distribution shift benchmarks across image and text domains and find that generative classifiers consistently do better under distribution shift than discriminative approaches. |
| Researcher Affiliation | Academia | Alexander C. Li Carnegie Mellon University EMAIL Ananya Kumar Stanford University EMAIL Deepak Pathak Carnegie Mellon University EMAIL |
| Pseudocode | Yes | Algorithm 1 gives an overview of the generative classification procedure. |
| Open Source Code | No | The paper does not provide an explicit statement of code release by the authors for their methodology, nor does it include a link to a code repository. It mentions using "the official training codebase released by Koh et al. (2021)" for baselines, but this is a third-party tool. |
| Open Datasets | Yes | We consider classification under two types of distribution shift. In subpopulation shift... on Celeb A (Liu et al., 2015)... We also consider domain shift... Camelyon17-WILDS (Koh et al., 2021)... We examine 5 common distribution shift benchmarks in total: besides Celeb A and Camelyon, we use Waterbirds ( Sagawa et al. (2019); subpopulation shift), FMo W (Koh et al. (2021); both subpopulation and domain shift), and Civil Comments (Koh et al. (2021); subpopulation shift). ...We additionally run experiments on two highly-used subpopulation shift benchmarks from BREEDS (Santurkar et al., 2020): Living-17 (with 17 animal classes) and Entity-30 (with 30 classes). |
| Dataset Splits | Yes | We use five standard benchmarks for distribution shift. Camelyon undergoes domain shift, so we report its OOD accuracy on the test data. Waterbirds, Celeb A, and Civil Comments undergo subpopulation shift, so we report worst group accuracy. FMo W has both subpopulation shift over regions and a domain shift across time, so we report OOD worst group accuracy. |
| Hardware Specification | Yes | Each diffusion model requires about 3 A6000 days to train. |
| Software Dependencies | No | The paper mentions using Adam W and a Llama-style architecture, but does not provide specific version numbers for software libraries, programming languages (e.g., Python), or frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | We train it from scratch with Adam W (Loshchilov & Hutter, 2017) with a constant base learning rate of 1e-6 and no weight decay or dropout. We did not tune diffusion model hyperparameters and simply used the default settings for conditional image generation. ... For training, we pad shorter sequences to a length of 512 and only compute loss for non-padded tokens. We train for up to 200k iterations... we sweep over learning rate, weight decay, and dropout based on their effect on the data log-likelihood. |