Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities
Authors: Ruchika Chavhan, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Couto Pimentel Ramos, Luca Morreale, Mehdi Noroozi, Sourav Bhattacharya
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks including image editing, super-resolution, and inpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models. |
| Researcher Affiliation | Industry | 1Samsung AI Center, Cambridge. Correspondence to: Ruchika Chavhan <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using text and equations (e.g., Section 5: Methodology: Multi-task Upcycling for Diffusion Models, and Equation 1, 2) and provides an overview diagram in Figure 3. However, there are no clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | T2I: We use the COCO Captions dataset (Lin et al., 2015), a large collection of image-text pairs. Image Editing: We use the dataset introduced in (Brooks et al., 2023), which provides input and target images along with corresponding editing instructions. Super Resolution: We use the Real-ESRGAN dataset (Wang et al., 2021), which consists of high-resolution images. Image Inpainting: We use the dataset from (Yildirim et al., 2023), which provides a multi-modal inpainting dataset designed for object removal based on text prompts. Built on the GQA dataset (Hudson & Manning, 2019), it leverages scene graphs to generate paired training data using state-of-the-art instance segmentation and inpainting techniques. |
| Dataset Splits | Yes | Task Dataset Train Val Test T2I COCO Captions (Li et al., 2017) 118287 5000 5000 Image Editing Instruct Pix2pix (Brooks et al., 2023) 281709 31301 2000 Super Resolution Real ESRGAN(Wang et al., 2021) 23744 100 100 Inpainting GQA-Inpaint (Yildirim et al., 2023) 90089 10009 5553 |
| Hardware Specification | Yes | Both models are trained on 8 A100 GPUs for 100 epochs, with a batch size of 16 per GPU and image resolution of 512 512. |
| Software Dependencies | No | The paper mentions optimizers like Adam W and Adam, but does not provide specific version numbers for software libraries, programming languages, or other tools used for implementation. |
| Experiment Setup | Yes | Both models are trained on 8 A100 GPUs for 100 epochs, with a batch size of 16 per GPU and image resolution of 512 512. SDXL is optimized using Adam W with a learning rate of 5e-5, while SDv1.5 is trained using Adam with a learning rate of 1e-4. For SDXL, we find that using a weight decay of 0.01 helps stabilize training. During sampling, we perform denoising for 20 iterations in multi-task SDv1.5 and 50 iterations in SDXL. For Text-to-Image (T2I) generation, Image Editing, and Inpainting, we apply Classifier-Free Guidance (CFG) (Ho & Salimans, 2022). However, for Super-Resolution (SR), no CFG is used, as it only processes an empty string as input. For T2I generation, we use a guidance scale of 7.5 for SDv1.5 and 5.0 for SDXL. For Image Editing and Inpainting, we follow the CFG strategy from (Brooks et al., 2023), which employs dual guidance scales one for image and another for text. For SDv1.5, we use an image guidance scale of 1.6 and a text guidance scale of 7.5 for Image Editing, and 1.5 and 4.0 for Inpainting, respectively. For SDXL, we set the image guidance scale to 1.5 and the text guidance scale to 10.0 for Image Editing, while for Inpainting, we use 1.5 for image and 4.0 for text. |