Semantix: An Energy-guided Sampler for Semantic Style Transfer
Authors: Huiang He, Minghui HU, Chuanxia Zheng, Chaoyue Wang, Tat-Jen Cham
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Semantix not only effectively accomplishes the task of semantic style transfer across images and videos, but also surpasses existing state-of-the-art solutions in both fields. ... In this section, we conduct an exhaustive experimental analysis to substantiate the efficacy and superiority of our proposed method through qualitative comparison (Sec. 5.1), quantitative comparison (Sec. 5.2) and ablation study (Sec. 5.3). |
| Researcher Affiliation | Collaboration | Huiang He South China University of Technology EMAIL; Minghui Hu Spell Brush & Nanyang Technological University EMAIL; Chuanxia Zheng VGG, University of Oxford EMAIL; Chaoyue Wang The University of Sydney EMAIL; Tat-Jen Cham College of Computing and Data Science Nanyang Technological University EMAIL |
| Pseudocode | Yes | Algorithm 1: Proposed Semantix |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | Yes | We select the COCO (Lin et al., 2014) dataset as the source of context images and obtain style images from Wiki Art (Tan et al., 2018) and appearance images from Cross-Image (Alaluf et al., 2023). |
| Dataset Splits | No | The paper describes its method as "training-free" and evaluates on "1000 sampled context-style image pairs" and "100 stylized videos." It specifies the number of samples used for evaluation but does not provide traditional training/test/validation dataset splits for model training, as its method does not involve a training phase. |
| Hardware Specification | Yes | We use NVIDIA A100 (80G) GPUs for all experiments. |
| Software Dependencies | No | The paper mentions building upon "pre-trained Stable Diffusion v1.5 model" and using "Animate Diff (Guo et al., 2023)" as a base model. However, it does not provide specific version numbers for programming languages, libraries, or operating systems used for implementation. |
| Experiment Setup | Yes | We invert the input images or videos into noises through DDPM inversion across 60 timesteps. For classifier-free guidance, we set the scale factor ω = 3.5, aligning it with the sampling procedures. During the sampling process, the features for guidance are extracted from the second and third blocks of the UNet s decoder. In image style transfer tasks, we adjust the weights of style feature guidance, spatial feature guidance and semantic distance regularisation to γref = 3.0, γc = 0.9, γreg = 1.0, respectively. Additionally, we incorporate a 2D position encoding into the features and assign it a weight of λpe = 3.0. For video task, the corresponding hyper-parameters are set to γref = 6.0, γc = 3.0, γreg = 5.0, λpe = 3.0. We further employ a hard clamp in the range of [ 1, 1] for all guidance. After 20 denoising timesteps, we apply Ada IN (Huang and Belongie, 2017) for the style latents xref t and output latents xout t . |