4K4DGen: Panoramic 4D Generation at 4K Resolution
Authors: Renjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, zhang xuanyang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhengzhong Tu, Zhiwen Fan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS 4.1 EXPERIMENTAL SETTINGS 4.2 RESULTS 4.3 ABLATION STUDIES |
| Researcher Affiliation | Collaboration | 1 Bytedance, 2 University of Texas at Austin, 3 University of California, Los Angeles, 4 Texas A&M University |
| Pseudocode | No | The paper describes its methodology using textual descriptions and mathematical equations (e.g., Eq. 1-6) but does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | Furthermore, we will make our panorama datasets and related code publicly available in the future. |
| Open Datasets | Yes | we evaluate our methodology using a dataset of 16 panoramas generated by text-to-panorama diffusion models (Yang et al., 2024). The static panoramas used in the dataset of the main draft are generated by a text-to-panorama diffusion model, fine-tuned from stable diffusion (Rombach et al., 2022) on SUN360. We present quantitative results on an additional 32 scenes randomly sampled from WEB360 dataset (Wang et al., 2024b). |
| Dataset Splits | No | The paper mentions using 16 panoramas and an additional 32 scenes randomly sampled, but does not specify any training/test/validation splits for these datasets. For evaluation, it mentions: "For the test views, we select random cameras with p = 0 as part of our testing camera set." |
| Hardware Specification | Yes | All experiments are executed on a single NVIDIA A100 GPU with 80 GB RAM. |
| Software Dependencies | No | The paper mentions specific models like "Animate-anything model (Dai et al., 2023)", "SVD model (Blattmann et al., 2023a)", and "Mi Da S (Ranftl et al., 2021; Birkl et al., 2023)" but does not provide specific version numbers for these or other ancillary software components like programming languages or libraries. |
| Experiment Setup | Yes | For perspective images, we uniformly select 20 directions u on the sphere S2 as the z-axis of 20 cameras. In each experiment, the image plane size s is set at 0.6 × 0.6, with a focal length f = 0.6 and a resolution of 512 × 512. For the Panoramic Animator, we set the video length L = 14, the channel number c = 9, the latent code size (h, w) = 1/8(H, W), the perspective image size pH = pW = 1/4W. The sphere is uniformly divided into 20 perspective views, each with 80 FOV. For the denoiser, the max denoising step is 25. The hyper-parameters for optimization are set as follows: λdepth = 1, λscale = 0.1, λshift = 0.01. We conduct Spatial-Temporal Geometry Alignment optimization over 3000 iterations, with λscale and λshift set to zero during the first 1500 iterations. For the 4D representation training stage, Gaussian parameters are optimized over 10000 iterations for each time stamp t. The hyper-parameters for this stage are defined as λrgb = 1, λtemporal = λsem = λgeo = 0.05, and the disturbance vector range α is varied at 0.05, 0.1, and 0.2 during the 5400, 6600, and 9000 iterations, respectively. |