MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences
Authors: Canyu Zhao, Mingyu Liu, Wen Wang, Weihua Chen, Fan Wang, Hao Chen, Bo Zhang, Chunhua Shen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present extensive experiments across various movie genres, demonstrating that our approach not only achieves superior visual and narrative quality but also effectively extends the duration of generated content significantly beyond current capabilities. ... In this paper, we introduce a hierarchical approach, dubbed Movie Dreamer, that marries autoregressive modeling with diffusion-based rendering to achieve a novel synthesis of long-term coherence and short-term fidelity in visual storytelling. ... 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Canyu Zhao1, Mingyu Liu1, Wen Wang1 Weihua Chen2 Fan Wang2 Hao Chen1 Bo Zhang1 Chunhua Shen1 1Zhejiang University 2Alibaba Group |
| Pseudocode | No | The paper describes its methodology in text and provides architectural diagrams (Figure 2, 9) but does not include any distinct pseudocode or algorithm blocks. |
| Open Source Code | No | Social impact. Our method can generate high-quality long stories and videos, significantly lowering the barrier for individuals to create desired visually appealing content. However, it is essential to address the negative potential social impact of our method. Malicious users may exploit its capabilities to generate inappropriate or harmful content. Consequently, it is imperative to emphasize the responsible and ethical utilization of our method, under the collective supervision and governance of society as a whole. Ensuring appropriate usage practices requires a collaborative effort from various people, including researchers, policymakers, and the wider community. Furthermore, the code, model, as well as the data, will be fully released to improve the development of related fields. |
| Open Datasets | Yes | The majority of our training data is sourced from Condensed Movies (Bain et al., 2020), with additional data collected using the same methodology as this dataset, resulting in 5M keyframes with corresponding script annotations. ... Our dataset comprises images from Open Images (Kuznetsova et al., 2018), Journey DB (Sun et al., 2023a), and Object365 (Shao et al., 2019). |
| Dataset Splits | Yes | The majority of our training data is sourced from Condensed Movies (Bain et al., 2020), with additional data collected using the same methodology as this dataset, resulting in 5M keyframes with corresponding script annotations. To systematically evaluate the effectiveness of our method, we construct a test dataset consisting of 100 long movies that are NOT included in the training set, with 1M keyframes after pre-processing. |
| Hardware Specification | Yes | The entire training process of the compressor and decoder takes 3 weeks with 6 NVIDIA A800 GPUs. ... The global autoregressive model is trained for 3 days using 4 NVIDIA H100 GPUs with a constant learning rate of 2e-5 and 2k steps for warm-up. |
| Software Dependencies | No | The paper mentions specific models and frameworks like LLaMA-7B, SDXL, CLIP, Long CLIP, and FARL, but does not provide specific version numbers for underlying software libraries like PyTorch, TensorFlow, Python, or CUDA, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We use two-layer MLPs with GELU as the activation function to unify different modalities into the input space of the large language model. ... We set the length of the context frames to 128, which results in the max sequence length around 5000. The global autoregressive model is trained for 3 days using 4 NVIDIA H100 GPUs with a constant learning rate of 2e-5 and 2k steps for warm-up. ... The decoder and the MLPs are trained using Adam W with a constant learning rate of 2e-5 for 120k steps and a noise offset of 0.05. ... An unusually high dropout rate of 50% is utilized... Token masking. We incorporate random masking of input tokens with a probability of 0.15... |