EG4D: Explicit Generation of 4D Object without Score Distillation

Authors: Qi Sun, Zhiyang Guo, Ziyu Wan, Jing Nathan Yan, Shengming Yin, Wengang Zhou, Jing Liao, Houqiang Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The qualitative and quantitative evaluations demonstrate that our framework outperforms the baselines in generation quality by a considerable margin. The qualitative results, quantitative evaluations and user preferences validate that our EG4D outperforms SDS-based baselines by a large margin, producing 4D content with realistic 3D appearance, high image fidelity, and fine temporal consistency. Extensive ablation studies also showcase our effective solutions to the challenges in reconstructing 4D representation with synthesized videos. Section 5: EXPERIMENTS, Section 5.1: EXPERIMENTAL SETTINGS, Section 5.2: RESULTS, Section 5.3: ABLATION STUDIES.
Researcher Affiliation Academia 1University of Science and Technology of China 2City University of Hong Kong 3Cornell University
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described using mathematical equations and descriptive text.
Open Source Code Yes Code available: github.com/jasongzy/EG4D
Open Datasets Yes The panel (a) uses video rendered from Objaverse (Deitke et al., 2023) dataset, a large-scale 3D dataset that also contains some animation models. Figure 15 (b) shows the 4D generation results from in-the-wild videos from the Consistent4D benchmark; In panel (c), we leverage the pose-conditioned character video generation model, Animate Anyone (Hu, 2024), as our video model in our framework.
Dataset Splits No The paper describes using input images and SVD-generated videos, as well as data from Objaverse and Consistent4D benchmarks. However, it does not specify any training, validation, or test dataset splits for the experiments conducted in this paper.
Hardware Specification Yes Our implementation is primarily based on the Py Torch framework and tested on a single NVIDIA RTX 3090 GPU.
Software Dependencies No The paper mentions 'Py Torch framework' and 'SDXL-Turbo (Sauer et al., 2023b)', but does not provide specific version numbers for PyTorch or other key software components.
Experiment Setup Yes In Stage I, we use SVD-img2vid-xl (Blattmann et al., 2023a) to generate 25-frame videos. For multi-view generation, we employ SV3Dp conditioned on a camera pose sequence, i.e., 21 azimuth angles (360 evenly divided) and a fixed 0 elevation. All images are set to a resolution of 576 × 576. In Stage III, we use SDXL-Turbo (Sauer et al., 2023b) with small strength (0.167) to provide the diffusion prior. In the semantic refinement stage (Stage III), we fine-tune 4DGS for 5k steps with Adam optimizer. The initial learning rate is set to 1e-4 with exponential decay. The weight λ in diffusion refinement loss is set to 0.5.