Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation
Authors: Xuyi Meng, Chen Wang, Jiahui Lei, Kostas Daniilidis, Jiatao Gu, Lingjie Liu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation. Project page: https://mengxuyigit.github.io/projects/zero-1-to-G/ |
| Researcher Affiliation | Academia | Xuyi Meng EMAIL University of Pennsylvania Chen Wang EMAIL University of Pennsylvania Jiahui Lei EMAIL University of Pennsylvania Kostas Daniilidis EMAIL University of Pennsylvania Jiatao Gu EMAIL University of Pennsylvania Lingjie Liu EMAIL University of Pennsylvania |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations (e.g., equations for diffusion process and loss functions) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project page: https://mengxuyigit.github.io/projects/zero-1-to-G/. While a project page is provided, it is not a direct link to a source-code repository, nor does the paper contain an explicit statement of code release in supplementary materials or similar. |
| Open Datasets | Yes | Dataset We train on the G-buffer Objaverse (Qiu et al., 2024) dataset, which consists of approximately 262,000 objects sourced from Objaverse (Deitke et al., 2024). Following prior works (Liu et al., 2023c;a; Wang et al., 2024a), we conduct quantitative comparisons using the Google Scanned Objects (GSO) dataset (Downs etal., 2022). Results on Real-world Datasets... MVImagenet Yu et al. (2023) |
| Dataset Splits | No | The paper mentions using a subset of 30 objects from the GSO dataset for evaluation and a subset of 26,000 objects from the full dataset for ablation studies. It also describes how viewpoints are generated for training data (e.g., 6 views per object from Objaverse). However, it does not provide specific train/validation/test splits (e.g., percentages or exact counts) for the main datasets of objects used in training and evaluation, nor does it refer to standard predefined splits for these datasets within the context of their experiments. |
| Hardware Specification | Yes | For the first stage, we use a batch size of 64 on 4 NVIDIA L40 GPUs for 13k iterations, which takes about 1 day. For the second stage, we use a batch size of 64 on 8 NVIDIA L40 GPUs for 30k iterations, which takes about 2 days. For decoder fine-tuning, we use a total batch size of 64 on 8 NVIDIA L40 GPUs for 20k iterations. During inference, we use cfg = 3.5 and our method can generate Gaussian splats per object in 8.7 seconds on a single NVIDIA L40 GPU. |
| Software Dependencies | No | The paper mentions using 'Stable Diffusion Image Variations' for initialization, 'Adam optimizer', and extending from 'DDPM', but does not provide specific version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or other ancillary software components. |
| Experiment Setup | Yes | For the first stage, we use a batch size of 64 on 4 NVIDIA L40 GPUs for 13k iterations, which takes about 1 day. For the second stage, we use a batch size of 64 on 8 NVIDIA L40 GPUs for 30k iterations, which takes about 2 days. For decoder fine-tuning, we use a total batch size of 64 on 8 NVIDIA L40 GPUs for 20k iterations. The second stage of training takes about 2 days. During inference, we use cfg = 3.5. We use a constant learning rate of 1e 4 with a warmup of the first 100 steps. We use the Adam optimizer for both stages and the betas are set to (0.9, 0.999). For classifier-free guidance, we drop the condition image with a probability of 0.1. |