GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

Authors: Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliaksandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav Shakhrai, Sergei Korolev, Sergey Tulyakov, Hsin-Ying Lee

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comprehensive comparison of our method with multiple concurrent works (Hong et al., 2024; Xu et al., 2024a; Tang et al., 2024) using the Google Scanned Object (GSO) (Downs et al., 2022) and the Omni Object3D (Wu et al., 2023) datasets, employing various evaluation metrics. For instance, in the 4-view reconstruction task using GSO dataset (Downs et al., 2022), our approach achieves a Peak Signal-to-Noise Ratio (PSNR) of 29.79, a Structural Similarity Index (SSIM) of 0.94 and a Learned Perceptual Image Patch Similarity (LPIPS) score of 0.059. Extensive experiments show that our feed-forward model achieves superior results compared to the baseline approaches, while our per-instance refinement approach enables further texture improvement on text and complex patterns.
Researcher Affiliation Industry The authors are listed as: Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliaksandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav Shakhrai, Sergei Korolev, Sergey Tulyakov, Hsin-Ying Lee. The paper also provides a URL: https://snap-research.github.io/GTR/. This URL, associated with 'snap-research.github.io', strongly indicates an affiliation with Snap Research, which is an industry entity. No other affiliations (academic or otherwise) are explicitly mentioned for any author.
Pseudocode No The paper includes figures (e.g., Figure 2, Figure 3) that visually illustrate the proposed approach and texture refinement procedure. However, it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code within the main text.
Open Source Code No The paper provides a URL: https://snap-research.github.io/GTR/. This is identified as a project demonstration page, not a direct link to a source-code repository (e.g., GitHub, GitLab). The paper does not contain an unambiguous sentence explicitly stating that the authors are releasing the code for the methodology described in the paper, nor does it provide a direct link to a code repository.
Open Datasets Yes We conduct a comprehensive comparison of our method with multiple concurrent works (Hong et al., 2024; Xu et al., 2024a; Tang et al., 2024) using the Google Scanned Object (GSO) (Downs et al., 2022) and the Omni Object3D (Wu et al., 2023) datasets, employing various evaluation metrics. Our model is trained on a 140k asset dataset, which merges the filtered Objaverse dataset (Deitke et al., 2023) with an internal 3D asset dataset.
Dataset Splits No The paper states: 'For each asset, we randomly choose 4 views as input and another 4 views for supervision.' and 'For each asset, we randomly choose 4 views as input and another 8 views for supervision.' It also mentions: 'Specifically, 300 GSO assets and 130 Omni Object3D assets (from 30 classes) are used for evaluation.' While these details describe how views are selected for training/supervision and which assets are used for evaluation, they do not provide specific training/test/validation dataset splits (e.g., exact percentages or counts of assets) for the entire dataset used in the main experiments, which is necessary for fully reproducible data partitioning.
Hardware Specification Yes In the Ne RF training stage, we use a batch size of 512 on 32 A100 GPUs. In the geometry refinement stage, the batch size is 192 on 32 A100 GPUs. Remarkably, our method achieves faithful texture reconstruction with just 20 steps of fine-tuning on 4-view images, requiring a mere 4 seconds on an A100 GPU. In Appendix A, it further mentions: 'We train both models on 8 80G A100 GPUs. Experiments are run on 32 80G A100 GPUs.'
Software Dependencies No The paper mentions specific components like 'Adam W optimizer' and refers to 'Zero123++ (Shi et al., 2023)' and a 'pretrained VAE encoder from an SD model (Rombach et al., 2022)', with a footnote pointing to a Hugging Face URL for the VAE model. However, it does not provide specific software library names with version numbers, such as Python, PyTorch, TensorFlow, or CUDA versions, which are required for a reproducible description of ancillary software.
Experiment Setup Yes In practice, the loss weights are set to λp = 0.5, λd = 0.5, λm = 1 and λn = 1. Input multi-view images are of 512 resolution. The triplane transformer contains 24 attention blocks with a hidden dimension of 1024. Each attention layer has 16 attention heads and each head has a dimension of 64. During the Ne RF training stage, images are rendered at 512 resolution, and the Ne RF model is trained using a patch size of 1282. We uniformly sample 256 points along each camera ray. The density and color MLPs consist of 3 and 4 layers, respectively, with a hidden size of 512. In the Ne RF training stage, we use an Adam W optimizer with a learning rate 1e 4 and a weight decay of 0.05. Cosine scheduling is employed to gradually reduce the learning rate to 0 after 150k training iterations. We use a batch size of 512 on 32 A100 GPUs. In the geometry refinement stage, we choose a grid size of 256 during mesh extraction using Diff MC. We use another Adam W optimizer with a learning rate 5e 5. The batch size is 192 on 32 A100 GPUs. In the per-instance texture refinement stage, the learning rates for the triplane feature and the color MLP, fc, are 0.15 and 1e 4, respectively.