Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
Authors: Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency. ... Extensive experiments on VBench verify the effectiveness of our proposal in terms of both visual and motion quality. |
| Researcher Affiliation | Collaboration | 1 University of Rochester, Rochester, NY USA 2 Hi Dream.ai Inc. |
| Pseudocode | No | The paper describes methods like 'Coherent tail latent sampling', 'Subject-Aware Cross-Frame Attention (SACFA)', and 'Self-Recurrent Guidance' using mathematical formulations and descriptive text, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing code or a link to a code repository. |
| Open Datasets | Yes | We empirically verify the merit of our Ouroboros-Diffusion for both single-scene and multi-scene long video generation on the VBench (Huang et al. 2024) benchmark. |
| Dataset Splits | Yes | We sample 93 common prompts from VBench as the testing set for single-scene video generation. ... For each multi-prompt group, we generate 256 video frames for performance comparison. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper states: 'We implement our Ouroboros Diffusion on the text-to-video model Video Crafter2 (Chen et al. 2024a).' However, it does not provide specific version numbers for underlying software libraries (e.g., PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | The total number of time steps T in the DDIM sampler is set to 64, matching the queue length. The threshold for the low-pass filter in coherent tail latent sampling is set to 0.25. SACFA is applied only in the down-blocks and mid-block (with down-sampling factors of 2 and 4) of the spatial-temporal UNet empirically. The last 16 frames in the queue are involved in SACFA calculation. The self-recurrent guidance derived from the first 16 frames at the queue head applies to the last 16 frames at the tail. The parameter λ for updating the subject feature bank is set to 0.98. |