TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models

Authors: Xin Jin, Yichuan Zhong, Yapeng Tian

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments 4.1 Implementation Details Model Architecture. All experiments employ SD-XL Podell et al. (2023) as the diffusion backbone. 4.2 Comparisons with SOTA models Quantitative Evaluation of Object Replacement and Blending. Table 1 presents BOM scores for 800 replacement blend pairs. 4.3 Ablation Study Ablation Study on CAOF. To examine how CAOF controls the fusion strength, we vary the blending coefficient w0 [0.1, 0.9] (Eq. 9) and record the CLIP similarities for the original (O), replaced (R), and blend (B) prompts.
Researcher Affiliation Collaboration Xin Jin EMAIL Gen Pi Inc. Yichuan Zhong EMAIL Gen Pi Inc. Yapeng Tian EMAIL The University of Texas at Dallas
Pseudocode No The paper describes the methods, CAOF and SASF, using prose, mathematical equations (Eq. 1-19), and flowcharts (Figure 4 and Figure 6). It does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code Availability. Code is available at https://github.com/felixxinjin1/TP-Blend.
Open Datasets Yes For our evaluation, we assembled a diverse set of high-resolution, publicly available images from Unsplash1, following the same practice as prior work such as SLIDE Jampani et al. (2021) and Text-driven Image Editing via Learnable Regions Lin et al. (2024). The test dataset consists of 4,000 samples, created by pairing 40 base images with 20 distinct replace-blend object combinations and 5 distinct blend styles. 1https://unsplash.com/
Dataset Splits Yes The test dataset consists of 4,000 samples, created by pairing 40 base images with 20 distinct replace-blend object combinations and 5 distinct blend styles.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions using "SD-XL Podell et al. (2023) as the diffusion backbone," which refers to a model, not the hardware it ran on.
Software Dependencies No The paper mentions using "SD-XL Podell et al. (2023) as the diffusion backbone" and algorithms like "Classifier-Free Guidance (CFG)", "DDIM inversion Song et al. (2020)", "Optimal Transport", and the "Sinkhorn algorithm Cuturi (2013); Peyré et al. (2019); Genevay et al. (2016)". However, it does not provide specific version numbers for any software libraries, programming languages, or other ancillary software dependencies used for implementation.
Experiment Setup Yes During the forward denoising pass we apply, at every timestep: (i) TIE-CFG for object replacement (positive guidance on the target prompt, negative on the original); (ii) CAOF to transport blend-object features into attention positions selected by the joint percentile thresholds τsource = τdest {0.6, 0.7}; and (iii) SASF to inject style via DSIN and key-value substitution. The Sinkhorn regulariser is fixed to γ = 0.1, with cost weights λfeature = 0.7 and λspatial = 0.3 (Eq. 10).