ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance
Authors: Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Humphrey Shi, Yunchao Wei
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of fine-tuning models. Lastly, we also extend our Class Diffusion to personalized video generation, demonstrating its flexibility. |
| Researcher Affiliation | Collaboration | 1 Institute of Information Science, Beijing Jiaotong University 2 Visual Intelligence + X International Joint Laboratory of the Ministry of Education 3 Byte Dance Inc. 4 SHI Labs@Georgia Tech |
| Pseudocode | Yes | Algorithm 1 Algorithm to Convert Character Set to 2D Point Set |
| Open Source Code | No | The paper discusses the source code of a third-party tool or platform that the authors used for a baseline method (SVDiff), but does not provide their own implementation code for ClassDiffusion or state that their code is open source. |
| Open Datasets | Yes | Datasets Following previous work [29, 66, 75], we conduct quantitative experiments on Dream Booth Dataset [66]. It contains 30 objects including both live objects and non-live objects. In addition, we used images from the Textual Inversion Dataset [20] and Custom Concept101 [38] in qualitative experiments. |
| Dataset Splits | No | The paper mentions using well-known datasets like Dream Booth Dataset, Textual Inversion Dataset, and Custom Concept101, which likely have standard splits defined in their respective original works. However, this paper does not explicitly state the training, validation, or test splits (e.g., percentages, sample counts, or specific methodology) used for these datasets within its own text. |
| Hardware Specification | Yes | All experiments are conducted on 2 RTX4090 GPUs. |
| Software Dependencies | Yes | Our method is built on Stable Diffusion V1.5, with a learning rate 10 6, and batch size 2 for fine-tuning. We used 500 optimization steps for a single concept and 800 for multiple concepts, respectively. During inference, the guidance scale is set to 6.0 and the inference steps are set to 100. The semantical preservation loss weight is set to 1.0 during all experiments. All experiments are conducted on 2 RTX4090 GPUs. Our method uses 6 min for the generation of single concepts and 11 min for the generation of multiple concepts. To better preserve the semantic space, we compute SPL between text embeddings embedded in the semantic space of the Stable Diffusion model. Therefore, we utilize the CLIP [61] text encoder from Stable Diffusion v1.5 [63], specifically clip-vit-large-patch14 [47], to extract the text embeddings of phrases. Following common practice, we use the End of Sequence (EOS) token to represent the semantics of embeddings. |
| Experiment Setup | Yes | Implementation details Our method is built on Stable Diffusion V1.5, with a learning rate 10 6, and batch size 2 for fine-tuning. We used 500 optimization steps for a single concept and 800 for multiple concepts, respectively. During inference, the guidance scale is set to 6.0 and the inference steps are set to 100. The semantical preservation loss weight is set to 1.0 during all experiments. |