Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization
Authors: Yuanyuan Chang, Yinghua Yao, Tao Qin, Mengmeng Wang, Ivor Tsang, Guang Dai
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data. ... Table 1 and Table 2 shows quantitative results. We calculate the LPIPS metric (lower is better) [Zhang et al., 2018] for editing different attributes on human face data and species conversion between animal data. |
| Researcher Affiliation | Collaboration | 1MOE Key Laboratory for Intelligent Networks and Network Security, Xi an Jiaotong University 2Center for Frontier AI Research, Agency for Science, Technology and Research, Singapore ... 4Zhejiang University of Technology 5SGIT AI Lab, State Grid Corporation of China |
| Pseudocode | Yes | Algorithm 1 Training algorithm |
| Open Source Code | Yes | Code is available at https://github.com/Chang-yuanyuan/CASO. |
| Open Datasets | Yes | The datasets used include: FFHQ [Karras et al., 2019], AFHQ [Choi et al., 2020], Celeb AHQ [Karras, 2017] and Stanford Cars datasets [Krause et al., 2013]. |
| Dataset Splits | No | With a well-trained classifier, only 100-200 images are needed to train the embedding. ... We calculated the FID metrics [Heusel et al., 2017] for AFHQ Cat and Dog datasets under unconditional reconstruction and guided reconstruction with cat and dog embeddings, respectively. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, etc.) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions using 'Stable Diffusion-v1.51' and 'VGG16 model' but does not provide specific version numbers for software dependencies like programming languages or libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | During training process, L is set to 0.3T for human face and 0.4T for others. During editing, for subtle features like eyebrows, we start to apply our direction from t [0.1T, 0.3T], while for some coarse-grained changes like species, editing at earlier timesteps is required (t [0.8T, 0.9T]). The results we show in the main text are all done with timesteps T = 50. ... The final training objective is as follows: min {ea}K a=1 Ledit + γLrec. (13) |