ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

Authors: Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Humphrey Shi, Yunchao Wei

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of fine-tuning models. Lastly, we also extend our Class Diffusion to personalized video generation, demonstrating its flexibility.
Researcher Affiliation Collaboration 1 Institute of Information Science, Beijing Jiaotong University 2 Visual Intelligence + X International Joint Laboratory of the Ministry of Education 3 Byte Dance Inc. 4 SHI Labs@Georgia Tech
Pseudocode Yes Algorithm 1 Algorithm to Convert Character Set to 2D Point Set
Open Source Code No The paper discusses the source code of a third-party tool or platform that the authors used for a baseline method (SVDiff), but does not provide their own implementation code for ClassDiffusion or state that their code is open source.
Open Datasets Yes Datasets Following previous work [29, 66, 75], we conduct quantitative experiments on Dream Booth Dataset [66]. It contains 30 objects including both live objects and non-live objects. In addition, we used images from the Textual Inversion Dataset [20] and Custom Concept101 [38] in qualitative experiments.
Dataset Splits No The paper mentions using well-known datasets like Dream Booth Dataset, Textual Inversion Dataset, and Custom Concept101, which likely have standard splits defined in their respective original works. However, this paper does not explicitly state the training, validation, or test splits (e.g., percentages, sample counts, or specific methodology) used for these datasets within its own text.
Hardware Specification Yes All experiments are conducted on 2 RTX4090 GPUs.
Software Dependencies Yes Our method is built on Stable Diffusion V1.5, with a learning rate 10 6, and batch size 2 for fine-tuning. We used 500 optimization steps for a single concept and 800 for multiple concepts, respectively. During inference, the guidance scale is set to 6.0 and the inference steps are set to 100. The semantical preservation loss weight is set to 1.0 during all experiments. All experiments are conducted on 2 RTX4090 GPUs. Our method uses 6 min for the generation of single concepts and 11 min for the generation of multiple concepts. To better preserve the semantic space, we compute SPL between text embeddings embedded in the semantic space of the Stable Diffusion model. Therefore, we utilize the CLIP [61] text encoder from Stable Diffusion v1.5 [63], specifically clip-vit-large-patch14 [47], to extract the text embeddings of phrases. Following common practice, we use the End of Sequence (EOS) token to represent the semantics of embeddings.
Experiment Setup Yes Implementation details Our method is built on Stable Diffusion V1.5, with a learning rate 10 6, and batch size 2 for fine-tuning. We used 500 optimization steps for a single concept and 800 for multiple concepts, respectively. During inference, the guidance scale is set to 6.0 and the inference steps are set to 100. The semantical preservation loss weight is set to 1.0 during all experiments.