ExpertDiff: Head-less Model Reprogramming with Diffusion Classifiers for Out-of-Distribution Generalization

Authors: Jee Seok Yoon, Junghyo Sohn, Wootaek Jeong, Heung-Il Suk

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of Expert Diff on the various OOD datasets (i.e., medical and satellite imagery). Furthermore, we qualitatively showcase Expert Diff s ability to faithfully reconstruct input images, highlighting its potential for both downstream discriminative and upstream generative tasks. Our work paves the way for effectively repurposing powerful foundation models for novel OOD applications requiring domain expertise.
Researcher Affiliation Academia Jee Seok Yoon1 , Junghyo Sohn2 , Wootaek Jeong2 and Heung-Il Suk2, 1Deptartment of Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea 2Deptartment of Artificial Intelligence, Korea University, Seoul, Republic of Korea EMAIL
Pseudocode Yes Algorithm 1 Training Expert Diff with Optimal Timestep
Open Source Code No For our proposed method, we utilize the publicly available pre-trained Stable Diffusion1 as the backbone architecture. However, our method is not limited to this specific model and can be applied to most text-to-image diffusion models. (Footnote 1: https://hf.co/stabilityai/stable-diffusion-2-1). There is no explicit statement about the authors releasing their own code for Expert Diff.
Open Datasets Yes Dataset To test the effectiveness of our proposed method, we conduct experiments across 3 medical datasets (Breast ultrasound [Al-Dhabyani et al., 2020], Chest X-ray [Kermany and others, 2018], and Camelyon17-WILDS breast cancer miscropy [Sagawa et al., 2022]) and the Euro SAT satellite dataset [Helber et al., 2018].
Dataset Splits Yes We conducted extensive experiments using the Camelyon17-WILDS dataset [Koh and others, 2021], where the task is to classify samples from unseen domains in domain generalization settings (see Table 2). Specifically, this dataset presents a challenging task of classifying breast cancer metastases in whole-slide histological images of lymph node sections, with images sourced from multiple hospitals representing different domains. We compared Expert Diff against Contri Mix [Nguyen et al., 2024], which is the highest ranking discriminative model in the official leaderboard3 at the time of writing, CLIP-LP [Radford et al., 2021], which is the finetuned linear probe with CLIP embeddings, prompt learning methods, and Diff TTA. Notably, Expert Diff s performance shows consistent superiority across all data regimes, from zero-shot to fully supervised. This is in contrast to other methods like Co Op, Co Co Op, and Diff TTA, which show limited improvement or even decreased performance as more training data becomes available. The discriminative model ourperforms the proposed method in fully supervised settings, but tends to overfit or perform at near chance levels in few-shot scenarios.
Hardware Specification Yes We trained our model for 100,000 iterations for fully supervised learning and 20,000 iterations for few-shot learning using a single Nvidia RTX 4090.
Software Dependencies Yes We use the Vi T-H/14 CLIP model2 for all of the methods and use Stable Diffusion V2.11 for diffusion classifiers. (Footnote 1: https://hf.co/stabilityai/stable-diffusion-2-1)
Experiment Setup Yes We trained our model for 100,000 iterations for fully supervised learning and 20,000 iterations for few-shot learning using a single Nvidia RTX 4090. Following Diff TTA s zero-shot setting [Prabhudesai et al., 2023], we sample random Gaussian noise x T N(x T ; 0, I), pair it with each class s prompt embedding c, and partially reverse-diffuse it via Eq. 2, producing synthetic triplets (xt , c,ˆc). Then, we train the model for 1,000 iterations using these synthetic samples. For fully supervised learning, all prompt learning methods were trained for 20 epochs, and Expert Diff and Diff TTA was trained for 100,000 iterations. For a fair competition, we fixed the number of Monte-carlo sampling to 1,000 for diffusion classifiers (i.e., 50 uniformly distributed timepoints 20 noise samples per timepoint) and Expert Diff (i.e., 1 optimal timepoint 1,000 noise samples).