Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization

Authors: Yuanyuan Chang, Yinghua Yao, Tao Qin, Mengmeng Wang, Ivor Tsang, Guang Dai

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data. ... Table 1 and Table 2 shows quantitative results. We calculate the LPIPS metric (lower is better) [Zhang et al., 2018] for editing different attributes on human face data and species conversion between animal data.
Researcher Affiliation Collaboration 1MOE Key Laboratory for Intelligent Networks and Network Security, Xi an Jiaotong University 2Center for Frontier AI Research, Agency for Science, Technology and Research, Singapore ... 4Zhejiang University of Technology 5SGIT AI Lab, State Grid Corporation of China
Pseudocode Yes Algorithm 1 Training algorithm
Open Source Code Yes Code is available at https://github.com/Chang-yuanyuan/CASO.
Open Datasets Yes The datasets used include: FFHQ [Karras et al., 2019], AFHQ [Choi et al., 2020], Celeb AHQ [Karras, 2017] and Stanford Cars datasets [Krause et al., 2013].
Dataset Splits No With a well-trained classifier, only 100-200 images are needed to train the embedding. ... We calculated the FID metrics [Heusel et al., 2017] for AFHQ Cat and Dog datasets under unconditional reconstruction and guided reconstruction with cat and dog embeddings, respectively.
Hardware Specification No No specific hardware details (GPU/CPU models, etc.) used for running experiments are mentioned in the paper.
Software Dependencies No The paper mentions using 'Stable Diffusion-v1.51' and 'VGG16 model' but does not provide specific version numbers for software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes During training process, L is set to 0.3T for human face and 0.4T for others. During editing, for subtle features like eyebrows, we start to apply our direction from t [0.1T, 0.3T], while for some coarse-grained changes like species, editing at earlier timesteps is required (t [0.8T, 0.9T]). The results we show in the main text are all done with timesteps T = 50. ... The final training objective is as follows: min {ea}K a=1 Ledit + γLrec. (13)