TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models
Authors: Xin Jin, Yichuan Zhong, Yapeng Tian
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments 4.1 Implementation Details Model Architecture. All experiments employ SD-XL Podell et al. (2023) as the diffusion backbone. 4.2 Comparisons with SOTA models Quantitative Evaluation of Object Replacement and Blending. Table 1 presents BOM scores for 800 replacement blend pairs. 4.3 Ablation Study Ablation Study on CAOF. To examine how CAOF controls the fusion strength, we vary the blending coefficient w0 [0.1, 0.9] (Eq. 9) and record the CLIP similarities for the original (O), replaced (R), and blend (B) prompts. |
| Researcher Affiliation | Collaboration | Xin Jin EMAIL Gen Pi Inc. Yichuan Zhong EMAIL Gen Pi Inc. Yapeng Tian EMAIL The University of Texas at Dallas |
| Pseudocode | No | The paper describes the methods, CAOF and SASF, using prose, mathematical equations (Eq. 1-19), and flowcharts (Figure 4 and Figure 6). It does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code Availability. Code is available at https://github.com/felixxinjin1/TP-Blend. |
| Open Datasets | Yes | For our evaluation, we assembled a diverse set of high-resolution, publicly available images from Unsplash1, following the same practice as prior work such as SLIDE Jampani et al. (2021) and Text-driven Image Editing via Learnable Regions Lin et al. (2024). The test dataset consists of 4,000 samples, created by pairing 40 base images with 20 distinct replace-blend object combinations and 5 distinct blend styles. 1https://unsplash.com/ |
| Dataset Splits | Yes | The test dataset consists of 4,000 samples, created by pairing 40 base images with 20 distinct replace-blend object combinations and 5 distinct blend styles. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions using "SD-XL Podell et al. (2023) as the diffusion backbone," which refers to a model, not the hardware it ran on. |
| Software Dependencies | No | The paper mentions using "SD-XL Podell et al. (2023) as the diffusion backbone" and algorithms like "Classifier-Free Guidance (CFG)", "DDIM inversion Song et al. (2020)", "Optimal Transport", and the "Sinkhorn algorithm Cuturi (2013); Peyré et al. (2019); Genevay et al. (2016)". However, it does not provide specific version numbers for any software libraries, programming languages, or other ancillary software dependencies used for implementation. |
| Experiment Setup | Yes | During the forward denoising pass we apply, at every timestep: (i) TIE-CFG for object replacement (positive guidance on the target prompt, negative on the original); (ii) CAOF to transport blend-object features into attention positions selected by the joint percentile thresholds τsource = τdest {0.6, 0.7}; and (iii) SASF to inject style via DSIN and key-value substitution. The Sinkhorn regulariser is fixed to γ = 0.1, with cost weights λfeature = 0.7 and λspatial = 0.3 (Eq. 10). |