ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge

Authors: Eslam Abdelrahman, Liangbing Zhao, Tao Hu, MATTHIEU CORD, Patrick Perez, Mohamed Elhoseiny

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on diverse datasets, including LSUN-Churches, Image Net, Celeb HQ, and LAION-Art, demonstrate the efficacy of our approach, consistently outperforming state-of-the-art methods. For instance, Toddler Diffusion achieves notable efficiency, matching LDM performance on LSUN-Churches while operating 2 faster with a 3 smaller architecture.
Researcher Affiliation Collaboration Eslam Abdelrahman 1, Liangbing Zhao 1, Vincent Tao Hu 2, Matthieu Cord 3, Patrick Perez 4, Mohamed Elhoseiny 1 1KAUST 2LMU 3 Valeo AI 4 Kyutai
Pseudocode Yes Algorithm 1 Training Pipeline Algorithm 2 Sampling Pipeline
Open Source Code No The project website is available at: https : //toddlerdiffusion.github.io/website/ The provided URL points to a project website, which is not an explicit code repository link or a statement of code release. The criteria specify that a project demonstration page or high-level project overview page is insufficient unless it directly hosts the source code or explicitly links to a repository for the methodology.
Open Datasets Yes Extensive experiments on datasets such as LSUN-Churches (Yu et al., 2015), Image Net, and Celeb A-HQ (Karras et al., 2017) demonstrate the effectiveness of this approach, consistently outperforming existing methods. Toddler Diffusion achieves notable efficiency, matching LDM performance on LSUN-Churches while operating 2 faster with a 3 smaller architecture. The project website is available at: https : //toddlerdiffusion.github.io/website/
Dataset Splits No The paper mentions using specific datasets (e.g., LSUN-Churches, Celeb HQ, Image Net-100) and training durations (e.g., 600 epochs, 250 epochs, 350 epochs), but it does not explicitly provide specific training/test/validation split percentages, absolute sample counts for each split, or detailed methodology for how the datasets were partitioned. While standard splits might be implied for well-known datasets, the paper does not explicitly state them or cite resources for the exact splits used in its experiments.
Hardware Specification Yes The training time is reported till convergence, i.e, 600 epochs, using 4 A100. The training time is calculated per epoch using 4 NVIDIA RTX A6000. The sampling time is calculated per frame using one NVIDIA RTX A6000 with a batch size equals 32.
Software Dependencies No The paper mentions several tools and models like VQGAN, UNet, Pidi Net, Edter, Canny, Laplacian, Fast SAM, inception model, DDPM-inversion, Control Net, SDEdit, LoRA, and DDIM. However, it does not specify any programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other software components with their corresponding version numbers, which are necessary for full reproducibility of the software environment.
Experiment Setup Yes For instance, LDM is trained using 1k steps; in contrast, Toddler Diffusion could be trained using only 10 steps with minimal impact on generation fidelity. LDM 600 8.15 0.013 0.52 0.41 Ours 600 7.10 0.009 0.61 0.47 Churches LDM 250 7.30 0.009 0.59 0.39 Ours 250 6.19 0.005 0.71 0.44 Image Net LDM 350 8.55 0.015 0.51 0.32 Ours 350 7.8 0.010 0.58 0.40 Starting from the converged model train on the Celeb HQ dataset for 600 epochs, we train our method and Control Net for and additional 50 epochs with the sketch as a condition. The 1st stage is trained for 200 steps, so s = 200 means we omit the sketch and feed pure noise. We have trained the 1st stage for 1K epochs as it is very small, only 5 M parameters, and the dataset scale is very small, thus the 1k epochs takes less than 12 hours using a single A100 GPU. Then, we train the 2nd stage, 141 Million parameters, for only 200 epochs. We train the model starting from the SDv1.5 weights for only five epochs.