UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame Organizer

Authors: Delong Liu, Zhaohui Hou, Mingjie Zhan, Shihao Han, Zhicheng Zhao, Fei Su

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results indicate that UFO effectively enhances video generation quality and demonstrates its superiority in public video generation benchmarks. Practical tests on public video generation benchmarks Vbench (Huang et al. 2024) demonstrate that the UFO notably enhances video consistency and quality. In our primary dimension of concern, Temporal Quality (TQ), it is clear that UFO significantly enhances both the consistency between the subject and background, and the smoothness of video motion. A higher intensity of UFO leads to more pronounced improvements, but it may also cause more videos with minimal dynamics to become static. However, in practical use, users can freely adjust the intensity of UFO based on video outcomes, thus avoiding such issues. Similarly, the Frame-Wise Quality (FWQ) dimension related to image quality shows the same trend because UFO effectively eliminates blurring and flickering issues in the video, thereby enhancing image quality.
Researcher Affiliation Collaboration Delong Liu1, Zhaohui Hou2, Mingjie Zhan2, Shihao Han2, Zhicheng Zhao1,3,4,*, Fei Su1,3,4 1School of Artificial Intelligence, Beijing University of Posts and Telecommunications 2Sense Time 3Beijing Key Laboratory of Network System and Network Culture, China 4Key Laboratory of Intereactive Technology and Experience System, Ministry of Culture and Tourism, Beijing, China EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations (e.g., Equation 1 and 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Delong-liu-bupt/UFO
Open Datasets Yes For training the consistency UFO, we use a subset of the LAION-Aesthetics V2 (Schuhmann et al. 2022) dataset with aesthetic scores above 6.5, from which we extract 12K image-text pairs to create static video-text pairs for training. For the training of stylization UFOs, we collect 300 videos for each of the four styles (Pixel Art, oil painting, animated style, black and white) from publicly available video resources on the internet.
Dataset Splits No The paper specifies the datasets used for training (12K image-text pairs from LAION-Aesthetics V2 and 300 videos per style), and mentions evaluating on the Vbench benchmark, but does not provide explicit training, validation, or test splits for the UFO's internal model training.
Hardware Specification Yes Training is conducted on 4 NVIDIA A100 GPUs, with inference running on a single GPU.
Software Dependencies No The paper mentions using specific models like Easy Animate-V2 and Open Sora-V1.2 and outlines training strategies, but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes Training is conducted on 4 NVIDIA A100 GPUs, with inference running on a single GPU. During the training process, only the parameters of the UFOs are updated, with each UFO undergoing 3000 training steps. All adapters have a hyperparameter dimension d = 4, and gradient accumulation is not used. For Open, a linear warm-up strategy is employed in the first 500 steps, where the learning rate gradually increases from nearly zero to 2e 4, and this rate is maintained after the warm-up phase. For Easy, the learning rate is set at 1e 4 and remains constant. The rest of the training settings follow the original methods. During inference, all settings use the recommended configurations of the original methods, with videos set at 24 Frames Per Second (FPS), and all experiments and visual effects in the text use the same random seed to compare with and without UFOs.