EG4D: Explicit Generation of 4D Object without Score Distillation
Authors: Qi Sun, Zhiyang Guo, Ziyu Wan, Jing Nathan Yan, Shengming Yin, Wengang Zhou, Jing Liao, Houqiang Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The qualitative and quantitative evaluations demonstrate that our framework outperforms the baselines in generation quality by a considerable margin. The qualitative results, quantitative evaluations and user preferences validate that our EG4D outperforms SDS-based baselines by a large margin, producing 4D content with realistic 3D appearance, high image fidelity, and fine temporal consistency. Extensive ablation studies also showcase our effective solutions to the challenges in reconstructing 4D representation with synthesized videos. Section 5: EXPERIMENTS, Section 5.1: EXPERIMENTAL SETTINGS, Section 5.2: RESULTS, Section 5.3: ABLATION STUDIES. |
| Researcher Affiliation | Academia | 1University of Science and Technology of China 2City University of Hong Kong 3Cornell University |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described using mathematical equations and descriptive text. |
| Open Source Code | Yes | Code available: github.com/jasongzy/EG4D |
| Open Datasets | Yes | The panel (a) uses video rendered from Objaverse (Deitke et al., 2023) dataset, a large-scale 3D dataset that also contains some animation models. Figure 15 (b) shows the 4D generation results from in-the-wild videos from the Consistent4D benchmark; In panel (c), we leverage the pose-conditioned character video generation model, Animate Anyone (Hu, 2024), as our video model in our framework. |
| Dataset Splits | No | The paper describes using input images and SVD-generated videos, as well as data from Objaverse and Consistent4D benchmarks. However, it does not specify any training, validation, or test dataset splits for the experiments conducted in this paper. |
| Hardware Specification | Yes | Our implementation is primarily based on the Py Torch framework and tested on a single NVIDIA RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions 'Py Torch framework' and 'SDXL-Turbo (Sauer et al., 2023b)', but does not provide specific version numbers for PyTorch or other key software components. |
| Experiment Setup | Yes | In Stage I, we use SVD-img2vid-xl (Blattmann et al., 2023a) to generate 25-frame videos. For multi-view generation, we employ SV3Dp conditioned on a camera pose sequence, i.e., 21 azimuth angles (360 evenly divided) and a fixed 0 elevation. All images are set to a resolution of 576 × 576. In Stage III, we use SDXL-Turbo (Sauer et al., 2023b) with small strength (0.167) to provide the diffusion prior. In the semantic refinement stage (Stage III), we fine-tune 4DGS for 5k steps with Adam optimizer. The initial learning rate is set to 1e-4 with exponential decay. The weight λ in diffusion refinement loss is set to 0.5. |