PanoDiT: Panoramic Videos Generation with Diffusion Transformer

Authors: Muyang Zhang, Yuzhi Chen, Rongtao Xu, Changwei Wang, Jinming Yang, Weiliang Meng, Jianwei Guo, Huihuang Zhao, Xiaopeng Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Compared to previous methods, our Pano Di T achieves state-of-the-art performance across various evaluation metrics and user study, with code is available in the supplementary material. The training configuration featured a resolution of 512 × 1024, a frame length of 144, a batch size of 2, a learning rate of 5 × 10−6, and a total of 100,000 training steps. We conducted comparative experiments using Animate Diff, 360DVD, and SVD, all of which were trained under identical conditions on WEB360 and our PHQ360 dataset to ensure a fair comparison. Quantitative Results. The quantitative results are given in Table 1, We report not only standard metrics for video evaluation, such as Fréchet Video Distance (Unterthiner et al. 2018) (FVD), but also Fréchet Inception Distance (Heusel et al. 2017) (FID) and Inception Score (IS) for individual frames of ERP videos.
Researcher Affiliation Academia 1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 2MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China 3Qilu University of Technology (Shandong Academy of Sciences), Shandong, China 4School of Artificial Intelligence, Beijing Normal University, Beijing, China 5College of Computer Science and Technology, Hengyang Normal University, Hunan, China
Pseudocode No The paper describes methods in prose and mathematical equations but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Compared to previous methods, our Pano Di T achieves state-of-the-art performance across various evaluation metrics and user study, with code is available in the supplementary material.
Open Datasets Yes We construct a novel Panoramic High-Quality 360 (PHQ360) Dataset based on WEB360, which has been meticulously refined using aesthetic and motion scoring, along with Likert scale-based human evaluation. In previous work introducing text-to-panoramic video datasets, datasets like ODV360 (Cao et al. 2023) and WEB360 (Wang et al. 2024) were developed.
Dataset Splits No The training configuration featured a resolution of 512 × 1024, a frame length of 144, a batch size of 2, a learning rate of 5 × 10−6, and a total of 100,000 training steps. This specifies training parameters but no explicit training/test/validation splits or their percentages/counts are mentioned for PHQ360 or WEB360.
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models or CPU specifications.
Software Dependencies No The paper does not list specific software components with their version numbers, such as Python, PyTorch, or CUDA versions.
Experiment Setup Yes The training configuration featured a resolution of 512 × 1024, a frame length of 144, a batch size of 2, a learning rate of 5 × 10−6, and a total of 100,000 training steps. We trained Pano Di T at three different scales: Small (S), Base (B), and Large (L) using our PHQ360 dataset.