PanoDiT: Panoramic Videos Generation with Diffusion Transformer
Authors: Muyang Zhang, Yuzhi Chen, Rongtao Xu, Changwei Wang, Jinming Yang, Weiliang Meng, Jianwei Guo, Huihuang Zhao, Xiaopeng Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Compared to previous methods, our Pano Di T achieves state-of-the-art performance across various evaluation metrics and user study, with code is available in the supplementary material. The training configuration featured a resolution of 512 × 1024, a frame length of 144, a batch size of 2, a learning rate of 5 × 10−6, and a total of 100,000 training steps. We conducted comparative experiments using Animate Diff, 360DVD, and SVD, all of which were trained under identical conditions on WEB360 and our PHQ360 dataset to ensure a fair comparison. Quantitative Results. The quantitative results are given in Table 1, We report not only standard metrics for video evaluation, such as Fréchet Video Distance (Unterthiner et al. 2018) (FVD), but also Fréchet Inception Distance (Heusel et al. 2017) (FID) and Inception Score (IS) for individual frames of ERP videos. |
| Researcher Affiliation | Academia | 1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 2MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China 3Qilu University of Technology (Shandong Academy of Sciences), Shandong, China 4School of Artificial Intelligence, Beijing Normal University, Beijing, China 5College of Computer Science and Technology, Hengyang Normal University, Hunan, China |
| Pseudocode | No | The paper describes methods in prose and mathematical equations but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Compared to previous methods, our Pano Di T achieves state-of-the-art performance across various evaluation metrics and user study, with code is available in the supplementary material. |
| Open Datasets | Yes | We construct a novel Panoramic High-Quality 360 (PHQ360) Dataset based on WEB360, which has been meticulously refined using aesthetic and motion scoring, along with Likert scale-based human evaluation. In previous work introducing text-to-panoramic video datasets, datasets like ODV360 (Cao et al. 2023) and WEB360 (Wang et al. 2024) were developed. |
| Dataset Splits | No | The training configuration featured a resolution of 512 × 1024, a frame length of 144, a batch size of 2, a learning rate of 5 × 10−6, and a total of 100,000 training steps. This specifies training parameters but no explicit training/test/validation splits or their percentages/counts are mentioned for PHQ360 or WEB360. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models or CPU specifications. |
| Software Dependencies | No | The paper does not list specific software components with their version numbers, such as Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | The training configuration featured a resolution of 512 × 1024, a frame length of 144, a batch size of 2, a learning rate of 5 × 10−6, and a total of 100,000 training steps. We trained Pano Di T at three different scales: Small (S), Base (B), and Large (L) using our PHQ360 dataset. |