$\text{I}^2\text{AM}$: Interpreting Image-to-Image Latent Diffusion Models via Bi-Attribution Maps

Authors: Junseo Park, Hyeryung Jang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of I2AM method, we conducted extensive experiments across various tasks and models, including object detection, inpainting, and super-resolution (Cheng et al., 2022; Yang et al., 2023a;b). Our results demonstrate that I2AM successfully captures critical attribution patterns in each task, offering valuable insights into the underlying generation process. Additionally, we introduce the Inpainting Mask Attention Consistency Score (IMACS) as a novel evaluation metric to assess the alignment between attribution maps and inpainting masks, which correlates strongly with existing performance metrics. Through extensive experiments, we show that I2AM enables model debugging and refinement, providing practical tools for improving I2I model s performance and interpretability.
Researcher Affiliation Academia Junseo Park and Hyeryung Jang Department of Computer Science & Artificial Intelligence, Dongguk University
Pseudocode No The paper includes mathematical formulas and block diagrams to illustrate concepts, but it does not contain any explicit sections or figures labeled as 'Pseudocode' or 'Algorithm', nor does it present structured, code-like steps for a method or procedure.
Open Source Code No The paper mentions 'Pre-trained models were obtained from their respective Git Hub repositories' when discussing existing models used in the experiments. However, there is no explicit statement or link indicating that the authors have released the source code for their own proposed methodology (I2AM).
Open Datasets Yes Paint-by-Example (PBE) was trained on the Open Images Kuznetsova et al. (2020). It consists of 16 million bounding boxes for 600 object classes across 1.9 million images. Stable VITON and DCI-VTON were trained on VITON-HD Choi et al. (2021). It is a dataset for high-resolution (i.e., 1024 768) virtual try-on of clothing items. Specifically, it consists of 13,679 frontal-view woman and top clothing image pairs are further split into 11,647/2,032 training/testing pairs. In this experiment, we evaluate I2AM s capability in object detection using images generated by the Paint-by-Example (PBE) model on the COCO Lin et al. (2014) dataset.
Dataset Splits Yes Stable VITON and DCI-VTON were trained on VITON-HD Choi et al. (2021). It is a dataset for high-resolution (i.e., 1024 768) virtual try-on of clothing items. Specifically, it consists of 13,679 frontal-view woman and top clothing image pairs are further split into 11,647/2,032 training/testing pairs.
Hardware Specification No The paper discusses various models and experimental setups but does not provide specific details regarding the hardware (e.g., GPU models, CPU specifications, or memory) used to conduct the experiments.
Software Dependencies No The paper mentions using 'Stable Diffusion v1.5' as the base for a custom model and 'DDIM Song et al. (2022) sampler' but does not provide version numbers for other key software dependencies like programming languages (e.g., Python) or deep learning frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes The model was trained for 90 epochs with a batch size of 64. DDIM was used for 50 steps with Tgroup = 5, and a CFG scale (s) of 5. The custom model has 8 attention heads (N = 8) and 9 cross-attention layers (L = 9), with SRAM visualizations focusing on layer 2. Where λDCML = 0.01, λTV = 0.0001, and λCWG = 2, these values represent the strength of each loss term during training. The standard deviation σ is set to 1: σ is set to 1.