Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Authors: Yeji Song, Jimyeong Kim, Wonhark Park, Wonsik Shin, Wonjong Rhee, Nojun Kwak

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments Datasets. Since prevailing benchmark datasets (Ruiz et al. 2023; Kumari et al. 2023) primarily utilize the generating prompts related to changing the texture or introducing new objects, they often fall short in effectively evaluating the crucial aspect of modifying subject poses. To address this limitation, we have constructed a new dataset, Deformable Subject Set (DS set), to effectively assess the model s capability to modify a subject s pose. The DS set comprises 38 live animals from the Dream Booth (Ruiz et al. 2023) and Custom Diffusion (Kumari et al. 2023), along with 11 prompts specifically designed to focus on the deformation of the subjects poses. Furthermore, we also utilized the Dream Booth dataset (DB set) (Ruiz et al. 2023) to evaluate the model s capacity in typical scenarios. Metrics. Following Dream Booth (Ruiz et al. 2023), we measured the subject fidelity using CLIP-I and DINO-I, and measured text alignment using CLIP-T. Qualitative Results. We present a comparative analysis of our method with the baselines in Figure 4. Quantitative Results. Table 1 and Figure 5 show quantitative analyses. User Study. We further evaluate our method through the user study conducted with Amazon Mechanical Turk. Ablations. Our orchestration eliminates the conflicting elements in the visual embedding that interfere with the textual embedding.
Researcher Affiliation Academia Seoul National University EMAIL
Pseudocode No The paper describes its methodology using mathematical equations (Eq. 3, 4, 5) and textual explanations, but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper mentions an 'Extended version https://arxiv.org/abs/2403.14155' which points to an arXiv preprint, but there is no explicit statement about open-sourcing the code for the described methodology or a link to a code repository.
Open Datasets Yes The DS set comprises 38 live animals from the Dream Booth (Ruiz et al. 2023) and Custom Diffusion (Kumari et al. 2023), along with 11 prompts specifically designed to focus on the deformation of the subjects poses. Furthermore, we also utilized the Dream Booth dataset (DB set) (Ruiz et al. 2023) to evaluate the model s capacity in typical scenarios.
Dataset Splits No The paper describes the composition and sources of the datasets used (DS set and DB set) but does not provide specific details on how these datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or references to predefined splits).
Hardware Specification No The paper does not provide any specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper does not list specific versions of software dependencies, libraries, or programming languages used in the experiments.
Experiment Setup No The paper describes the proposed methods, qualitative and quantitative results, and ablation studies, but it does not specify concrete experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs), optimizer settings, or training configurations for their models.