reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities

Authors: Ruchika Chavhan, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Couto Pimentel Ramos, Luca Morreale, Mehdi Noroozi, Sourav Bhattacharya

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks including image editing, super-resolution, and inpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models.
Researcher Affiliation	Industry	1Samsung AI Center, Cambridge. Correspondence to: Ruchika Chavhan <EMAIL>.
Pseudocode	No	The paper describes the methodology using text and equations (e.g., Section 5: Methodology: Multi-task Upcycling for Diffusion Models, and Equation 1, 2) and provides an overview diagram in Figure 3. However, there are no clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	T2I: We use the COCO Captions dataset (Lin et al., 2015), a large collection of image-text pairs. Image Editing: We use the dataset introduced in (Brooks et al., 2023), which provides input and target images along with corresponding editing instructions. Super Resolution: We use the Real-ESRGAN dataset (Wang et al., 2021), which consists of high-resolution images. Image Inpainting: We use the dataset from (Yildirim et al., 2023), which provides a multi-modal inpainting dataset designed for object removal based on text prompts. Built on the GQA dataset (Hudson & Manning, 2019), it leverages scene graphs to generate paired training data using state-of-the-art instance segmentation and inpainting techniques.
Dataset Splits	Yes	Task Dataset Train Val Test T2I COCO Captions (Li et al., 2017) 118287 5000 5000 Image Editing Instruct Pix2pix (Brooks et al., 2023) 281709 31301 2000 Super Resolution Real ESRGAN(Wang et al., 2021) 23744 100 100 Inpainting GQA-Inpaint (Yildirim et al., 2023) 90089 10009 5553
Hardware Specification	Yes	Both models are trained on 8 A100 GPUs for 100 epochs, with a batch size of 16 per GPU and image resolution of 512 512.
Software Dependencies	No	The paper mentions optimizers like Adam W and Adam, but does not provide specific version numbers for software libraries, programming languages, or other tools used for implementation.
Experiment Setup	Yes	Both models are trained on 8 A100 GPUs for 100 epochs, with a batch size of 16 per GPU and image resolution of 512 512. SDXL is optimized using Adam W with a learning rate of 5e-5, while SDv1.5 is trained using Adam with a learning rate of 1e-4. For SDXL, we find that using a weight decay of 0.01 helps stabilize training. During sampling, we perform denoising for 20 iterations in multi-task SDv1.5 and 50 iterations in SDXL. For Text-to-Image (T2I) generation, Image Editing, and Inpainting, we apply Classifier-Free Guidance (CFG) (Ho & Salimans, 2022). However, for Super-Resolution (SR), no CFG is used, as it only processes an empty string as input. For T2I generation, we use a guidance scale of 7.5 for SDv1.5 and 5.0 for SDXL. For Image Editing and Inpainting, we follow the CFG strategy from (Brooks et al., 2023), which employs dual guidance scales one for image and another for text. For SDv1.5, we use an image guidance scale of 1.6 and a text guidance scale of 7.5 for Image Editing, and 1.5 and 4.0 for Inpainting, respectively. For SDXL, we set the image guidance scale to 1.5 and the text guidance scale to 10.0 for Image Editing, while for Inpainting, we use 1.5 for image and 4.0 for text.