ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
Authors: Alec Helbling, Tuna Han Salih Meral, Benjamin Hoover, Pinar Yanardag, Duen Horng Chau
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the efficacy of CONCEPTATTENTION in a zeroshot semantic segmentation task on real world images. We compare our interpretative maps against annotated segmentations to measure the accuracy and relevance of the attributions generated by our method. Our experiments and extensive comparisons demonstrate that CONCEPTATTENTION provides valuable insights... and CONCEPTATTENTION achieves state-of-the-art performance in zero-shot segmentation on benchmarks like Image Net Segmentation and Pascal VOC across multiple Di T architectures. We perform several ablation studies to investigate the impact of various architectural choices and hyperparameters on the performance of CONCEPTATTENTION. |
| Researcher Affiliation | Collaboration | 1Georgia Tech 2Virginia Tech 3IBM Research. Correspondence to: Alec Helbling <EMAIL>. |
| Pseudocode | Yes | A. More In-depth Explanation of Concept Attention We show pseudo-code depicting the difference between a vanilla multi-modal attention mechanism and a multi-modal attention mechanism with concept attention added to it. Figure 9. Pseudo-code depicting the (a) multi-modal attention operation used by Flux Di Ts and (b) our CONCEPTATTENTION operation. |
| Open Source Code | Yes | Code: alechelbling.com/Concept Attention/ |
| Open Datasets | Yes | This evaluation protocol centers around the Image Net-Segmentation dataset (Guillaumin et al., 2014), and we extend this evaluation to the Pascal VOC dataset (Everingham et al., 2015). |
| Dataset Splits | Yes | We investigate both a single class (930 images) and multi-class split (1,449 images) of this dataset. |
| Hardware Specification | No | No specific hardware details (GPU models, CPU models, etc.) used for running experiments are explicitly mentioned in the paper. |
| Software Dependencies | No | Flux Di T For most of our experiments we use the Flux Di T architecture implemented in Py Torch (Paszke et al., 2019). |
| Experiment Setup | Yes | In our experiments we leverage the activations from the last 10 of the 18 MMATTN layers. ... Throughout the rest of our experiments we use timestep 500 out of 1000 following this result. |