GenXD: Generating Any 3D and 4D Scenes

Authors: Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim H Lee, Lijuan Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive evaluations across various real-world and synthetic datasets, demonstrating Gen XD s effectiveness and versatility compared to previous methods in 3D and 4D generation. 5 EXPERIMENT 5.1 EXPERIMENTAL SETUP Datasets. Gen XD is trained with the combination of 3D and 4D datasets.
Researcher Affiliation Collaboration National University of Singapore, Microsoft Corporation
Pseudocode No The paper does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our curated 4D dataset, Cam Vid-30K, and Gen XD model will be made publicly available.
Open Datasets Yes This large-scale dataset, termed Cam Vid-30K, will be made available for public use. For 3D datasets, we leverage five datasets with camera pose annotation: Objaverse (Deitke et al., 2023), MVImage Net (Yu et al., 2023), Co3D (Reizenstein et al., 2021), Re10K (Zhou et al., 2018) and ACID (Liu et al., 2021). For 4D datasets, we leverage the synthetic data Objaverse XL-Animation (Deitke et al., 2024; Liang et al., 2024) and our Cam Vid-30K.
Dataset Splits No The paper mentions using specific datasets for training and evaluation but does not explicitly provide the comprehensive training/test/validation splits for the datasets used to train the main Gen XD model. For a specific experiment, it states '3 views in each scene are used for training' but does not detail the splits for testing or validation for those datasets, nor for the overall training of Gen XD.
Hardware Specification Yes The model is trained on 32 A100 GPUs with batch size 128 and resolution 256 256.
Software Dependencies No The paper mentions the use of Stable Video Diffusion as a pretrained model and Adam W optimizer, but does not provide specific version numbers for software dependencies such as libraries, frameworks, or programming languages.
Experiment Setup Yes Gen XD is trained in three stages. We first train the UNet only with 3D data for 500K iteration and then fine-tune it with both 3D and 4D data for 500K iterations in single view mode. Finally, Gen XD is trained with both single view and multi-view mode with all the data for 500K iteration. The model is trained on 32 A100 GPUs with batch size 128 and resolution 256 256. Adam W (Loshchilov & Hutter, 2019) optimizer with learning rate 5 10 5 is adopted.