Semantix: An Energy-guided Sampler for Semantic Style Transfer

Authors: Huiang He, Minghui HU, Chuanxia Zheng, Chaoyue Wang, Tat-Jen Cham

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Semantix not only effectively accomplishes the task of semantic style transfer across images and videos, but also surpasses existing state-of-the-art solutions in both fields. ... In this section, we conduct an exhaustive experimental analysis to substantiate the efficacy and superiority of our proposed method through qualitative comparison (Sec. 5.1), quantitative comparison (Sec. 5.2) and ablation study (Sec. 5.3).
Researcher Affiliation Collaboration Huiang He South China University of Technology EMAIL; Minghui Hu Spell Brush & Nanyang Technological University EMAIL; Chuanxia Zheng VGG, University of Oxford EMAIL; Chaoyue Wang The University of Sydney EMAIL; Tat-Jen Cham College of Computing and Data Science Nanyang Technological University EMAIL
Pseudocode Yes Algorithm 1: Proposed Semantix
Open Source Code No The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets Yes We select the COCO (Lin et al., 2014) dataset as the source of context images and obtain style images from Wiki Art (Tan et al., 2018) and appearance images from Cross-Image (Alaluf et al., 2023).
Dataset Splits No The paper describes its method as "training-free" and evaluates on "1000 sampled context-style image pairs" and "100 stylized videos." It specifies the number of samples used for evaluation but does not provide traditional training/test/validation dataset splits for model training, as its method does not involve a training phase.
Hardware Specification Yes We use NVIDIA A100 (80G) GPUs for all experiments.
Software Dependencies No The paper mentions building upon "pre-trained Stable Diffusion v1.5 model" and using "Animate Diff (Guo et al., 2023)" as a base model. However, it does not provide specific version numbers for programming languages, libraries, or operating systems used for implementation.
Experiment Setup Yes We invert the input images or videos into noises through DDPM inversion across 60 timesteps. For classifier-free guidance, we set the scale factor ω = 3.5, aligning it with the sampling procedures. During the sampling process, the features for guidance are extracted from the second and third blocks of the UNet s decoder. In image style transfer tasks, we adjust the weights of style feature guidance, spatial feature guidance and semantic distance regularisation to γref = 3.0, γc = 0.9, γreg = 1.0, respectively. Additionally, we incorporate a 2D position encoding into the features and assign it a weight of λpe = 3.0. For video task, the corresponding hyper-parameters are set to γref = 6.0, γc = 3.0, γreg = 5.0, λpe = 3.0. We further employ a hard clamp in the range of [ 1, 1] for all guidance. After 20 denoising timesteps, we apply Ada IN (Huang and Belongie, 2017) for the style latents xref t and output latents xout t .