DiffFERV: Diffusion-based Facial Editing of Real Videos

Authors: Xiangyi Chen, Han Xue, Li Song

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Diff FERV achieves state-of-the-art performance in both reconstruction and editing tasks. ... Extensive evaluations demonstrate that Diff FERV excels in preserving facial identity, ensuring temporal consistency, especially when handling challenging real-world data. Diff FERV sets a new benchmark for robust, generalizable, and high-quality face video editing. ... Qualitative Results In Fig. 4, we present reconstruction results. ... Quantitative Results Table 1 shows that Diff FERV achieves the highest scores across all reconstruction metrics. ... Ablation Studies
Researcher Affiliation Academia Xiangyi Chen1 , Han Xue2Q , Li Song1Q 1Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University 2School of Computer Science and Technology, Donghua University EMAIL, EMAIL, song EMAIL
Pseudocode No The paper describes its methodology in text and uses mathematical equations for clarification (e.g., equations 1-8). However, it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Munchkin Chen/Diff FERV.
Open Datasets Yes For specialization, we initialize with the pretrained weights of Stable Diffusion 1.5. We utilize the FFHQ dataset [Karras et al., 2019] as our training dataset. ... We evaluate Diff FERV on Celeb V-HQ [Zhu et al., 2022].
Dataset Splits No The paper states: "Within our dataset, we integrate 10% of image-text pairs sampled from the LAION-2B-en [Rombach et al., 2022b] dataset." However, it does not provide specific train/test/validation splits for the primary datasets used (FFHQ for training, Celeb V-HQ for evaluation), nor does it specify how the Celeb V-HQ dataset was partitioned for testing.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU specifications, or memory amounts used for running the experiments. It mentions software like "Stable Diffusion 1.5" and "GMFlow" but no hardware.
Software Dependencies Yes For specialization, we initialize with the pretrained weights of Stable Diffusion 1.5. ... We employ Pixtral 2 for automatic captioning. ... We leverage GMFlow [Xu et al., 2022a] for optical flow prediction in TTA.
Experiment Setup Yes We opt for Adam [Kingma, 2014] optimizer with a batch size of 8 and a learning rate of 2.5e 6. For temporal modeling, we configure window length to w = 3 for SWCFA. During editing, we use DDIM [Song et al., 2021] sampling and inversion with T = 50 timesteps. A negative prompt [Ban et al., 2025] scheme is adopted, where the original prompt serves as the negative prompt to enhance editing effectiveness, with guidance scale set to 5. We use τapp = 0.9 for texture-level edits and τapp = 0.7 for shape-altering edits.