Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA

Authors: James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, Hongxia Jin

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we start by analyzing existing customized diffusion methods in the popular Stable Diffusion model (Rombach et al., 2022), showing that these models catastrophically fail for sequentially arriving fine-grained concepts (we specifically use human faces and landmarks). ... We show that C-Lo RA not only outperforms several baselines for our proposed setting of text-to-image continual customization, which we refer to as Continual Diffusion, but that we achieve a new state-of-the-art in the well-established rehearsal-free continual learning setting for image classification. ... Qualitative results are shown in Figure 4 showing samples from task 1, 6, and 10 after training all 10 tasks, while quantitative results are given in Table 1.
Researcher Affiliation Collaboration James Seale Smith EMAIL Samsung Research America Georgia Institute of Technology Yen-Chang Hsu Samsung Research America Lingyu Zhang Samsung Research America Ting Hua Samsung Research America Zsolt Kira Georgia Institute of Technology Yilin Shen Samsung Research America Hongxia Jin Samsung Research America
Pseudocode No The paper describes the C-Lo RA method, self-regularization, and customized token strategy in sections 3.1, 3.2, and 3.3 respectively, using mathematical formulas and descriptive text. However, it does not present these as a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper states: "For additional analysis on the efficacy for Lo RA (Hu et al., 2021) for text-to-image diffusion, we suggest the implementation by Ryu, which was a concurrent project to ours." This refers to a third-party implementation and not the authors' own source code for the methodology described in the paper.
Open Datasets Yes We first benchmark our method using the 512x512 resolution (self-generated) celebrity faces dataset, Celeb Faces Attributes (Celeb-A) HQ (Karras et al., 2017; Liu et al., 2015). ... As an additional dataset, we demonstrate the generality of our method and introduce an additional dataset with a different domain, benchmarking on a 10 length task sequence using the Google Landmarks dataset v2 (Weyand et al., 2020). ... We benchmark our approach using Image Net-R (Hendrycks et al., 2021; Wang et al., 2022b).
Dataset Splits No For the Celeb-A HQ dataset, the paper states: "We sample 10 celebrities at random which have at least 15 individual training images each. Each celebrity customization is considered a task". For Image Net-R, it mentions "10 tasks (20 classes per task)", "5-task", and "20 task" sequences. While these describe how tasks are defined or sampled, they do not provide specific training, validation, and test split percentages, absolute sample counts, or explicit references to predefined splits within each task for direct reproduction.
Hardware Specification Yes We use 2 A100 GPUs to generate all results.
Software Dependencies No We implement our method and all baselines in Py Torch(Paszke et al., 2019). While PyTorch is mentioned, no specific version number is provided, nor are any other software dependencies with version numbers.
Experiment Setup Yes For the most part, we use the same implementation details as Custom Diffusion (Kumari et al., 2022) with 2000 training iterations... For Lo RA, we searched for the rank using a simple exponential sweep and found that a rank of 16 sufficiently learns all concept. Additional training details are located in Appendix C. ... We found a learning rate of 5e 6 worked best for all non-Lo RA methods, and a learning rate of 5e 4 worked best for our Lo RA methods. We found a loss weight of 1e6 and 1e8 worked best for EWC (Kirkpatrick et al., 2017) and C-Lo RA respectively. ... We found a rank of 16 was sufficient for Lo RA for the text-to-image experiments, and 64 for the image classification experiments. ... All images are generated with a 512x512 resolution, and we train for 2000 steps on the face datasets and 500 steps on the waterfall datasets.