reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ExpertDiff: Head-less Model Reprogramming with Diffusion Classifiers for Out-of-Distribution Generalization

Authors: Jee Seok Yoon, Junghyo Sohn, Wootaek Jeong, Heung-Il Suk

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of Expert Diff on the various OOD datasets (i.e., medical and satellite imagery). Furthermore, we qualitatively showcase Expert Diff s ability to faithfully reconstruct input images, highlighting its potential for both downstream discriminative and upstream generative tasks. Our work paves the way for effectively repurposing powerful foundation models for novel OOD applications requiring domain expertise.
Researcher Affiliation	Academia	Jee Seok Yoon1 , Junghyo Sohn2 , Wootaek Jeong2 and Heung-Il Suk2, 1Deptartment of Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea 2Deptartment of Artificial Intelligence, Korea University, Seoul, Republic of Korea EMAIL
Pseudocode	Yes	Algorithm 1 Training Expert Diff with Optimal Timestep
Open Source Code	No	For our proposed method, we utilize the publicly available pre-trained Stable Diffusion1 as the backbone architecture. However, our method is not limited to this specific model and can be applied to most text-to-image diffusion models. (Footnote 1: https://hf.co/stabilityai/stable-diffusion-2-1). There is no explicit statement about the authors releasing their own code for Expert Diff.
Open Datasets	Yes	Dataset To test the effectiveness of our proposed method, we conduct experiments across 3 medical datasets (Breast ultrasound [Al-Dhabyani et al., 2020], Chest X-ray [Kermany and others, 2018], and Camelyon17-WILDS breast cancer miscropy [Sagawa et al., 2022]) and the Euro SAT satellite dataset [Helber et al., 2018].
Dataset Splits	Yes	We conducted extensive experiments using the Camelyon17-WILDS dataset [Koh and others, 2021], where the task is to classify samples from unseen domains in domain generalization settings (see Table 2). Specifically, this dataset presents a challenging task of classifying breast cancer metastases in whole-slide histological images of lymph node sections, with images sourced from multiple hospitals representing different domains. We compared Expert Diff against Contri Mix [Nguyen et al., 2024], which is the highest ranking discriminative model in the official leaderboard3 at the time of writing, CLIP-LP [Radford et al., 2021], which is the finetuned linear probe with CLIP embeddings, prompt learning methods, and Diff TTA. Notably, Expert Diff s performance shows consistent superiority across all data regimes, from zero-shot to fully supervised. This is in contrast to other methods like Co Op, Co Co Op, and Diff TTA, which show limited improvement or even decreased performance as more training data becomes available. The discriminative model ourperforms the proposed method in fully supervised settings, but tends to overfit or perform at near chance levels in few-shot scenarios.
Hardware Specification	Yes	We trained our model for 100,000 iterations for fully supervised learning and 20,000 iterations for few-shot learning using a single Nvidia RTX 4090.
Software Dependencies	Yes	We use the Vi T-H/14 CLIP model2 for all of the methods and use Stable Diffusion V2.11 for diffusion classifiers. (Footnote 1: https://hf.co/stabilityai/stable-diffusion-2-1)
Experiment Setup	Yes	We trained our model for 100,000 iterations for fully supervised learning and 20,000 iterations for few-shot learning using a single Nvidia RTX 4090. Following Diff TTA s zero-shot setting [Prabhudesai et al., 2023], we sample random Gaussian noise x T N(x T ; 0, I), pair it with each class s prompt embedding c, and partially reverse-diffuse it via Eq. 2, producing synthetic triplets (xt , c,ˆc). Then, we train the model for 1,000 iterations using these synthetic samples. For fully supervised learning, all prompt learning methods were trained for 20 epochs, and Expert Diff and Diff TTA was trained for 100,000 iterations. For a fair competition, we fixed the number of Monte-carlo sampling to 1,000 for diffusion classifiers (i.e., 50 uniformly distributed timepoints 20 noise samples per timepoint) and Expert Diff (i.e., 1 optimal timepoint 1,000 noise samples).