RB-Modulation: Training-Free Stylization using Reference-Based Modulation

Authors: Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 EXPERIMENTS Metrics: Evaluating stylized synthesis is challenging due to the subjective nature of style, making simple metrics inadequate. We follow a two step approach: first using metrics from prior works and then conducting human evaluation. To evaluate prompt-image alignment, we use CLIP-T score (Hertz et al., 2023; Sohn et al., 2023; Wang et al., 2024a) and Image Reward (Xu et al., 2024), which also consider human aesthetics, distortions, and object completeness. When a style description is provided, CLIP-T and Image Reward also capture style alignment. We assess style similarity using DINO (Caron et al., 2021) and content similarity using CLIP-I (Radford et al., 2021) as in prior work (Hertz et al., 2023; Ruiz et al., 2023; Sohn et al., 2023), and highlight their limitations in disentangling style and content performance in evaluation. Given the importance of human evaluation in T2I personalization (Hertz et al., 2023; Sohn et al., 2023; Ruiz et al., 2023; Shah et al., 2023; Jeong et al., 2024), we also conduct a user study though Amazon Mechanical Turk to measure both style and text alignment.
Researcher Affiliation Collaboration Litu Rout1,2 Yujia Chen1 Nataniel Ruiz1 Abhishek Kumar3 Constantine Caramanis2 Sanjay Shakkottai2 Wen-Sheng Chu1 1 Google 2 UT Austin 3 Google Deep Mind EMAIL EMAIL
Pseudocode Yes Algorithm 1: RB-Modulation (Exact) Algorithm 2: RB-Modulation (Proximal)
Open Source Code Yes See project page https://rb-modulation.github.io/ for code and further details. The source code is available on the project page: https://rb-modulation.github.io/.
Open Datasets Yes We use style images from Style Aligned benchmark (Hertz et al., 2023) for stylization and content images from Dream Booth (Ruiz et al., 2023) for content-style composition.
Dataset Splits No The paper mentions using "Style Aligned benchmark" and "Dream Booth" datasets for evaluation, and also conducts a "user study with 155 participants using 100 styles from the Style Aligned dataset... collecting a total of 7,200 answers". However, as the method is training-free, it does not involve traditional train/validation/test splits for model training. The paper specifies the datasets used for evaluation but does not detail how these datasets were split into training, validation, or testing sets in a manner typically required to reproduce a model's learning process. For example, it does not provide percentages or sample counts for these splits.
Hardware Specification Yes All experiments run on a single A100 NVIDIA GPU.
Software Dependencies No The paper mentions using "Stable Cascade (Pernias et al., 2024)" as a base model and components like "CLIP text encoder (Radford et al., 2021)", "CSD image encoder (Somepalli et al., 2024)", and "Lang SAM". While these are specific tools or models, the paper does not provide version numbers for general software dependencies such as Python, PyTorch, CUDA, or other libraries that would be necessary for full reproducibility.
Experiment Setup Yes Our method introduces only two hyper-parameters: stepsize η and optimization steps M in Algorithm 1. We use DDIM sampling with η = 0.1 and M = 3 for all the experiments.