RelitLRM: Generative Relightable Radiance for Large Reconstruction Models

Authors: Tianyuan Zhang, Zhengfei Kuang, Haian Jin, Zexiang Xu, Sai Bi, Hao Tan, HE Zhang, Yiwei Hu, Milos Hasan, William Freeman, Kai Zhang, Fujun Luan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on synthetic and real-world datasets to evaluate Relit LRM. The results demonstrate that our method matches state-of-the-art inverse rendering approaches while using significantly fewer input images and requiring much less processing time (seconds v.s. hours).
Researcher Affiliation Collaboration 1Massachusetts Institute of Technology 2Stanford University 3Cornell University 4Adobe Research
Pseudocode Yes A.1 PSEUDO CODE Algorithm 1 Relit LRM pseudo code.
Open Source Code No Our project page is available at: https://relit-lrm.github.io/. This URL points to a project demonstration page, not explicitly a source code repository, and the paper does not contain an explicit code release statement.
Open Datasets Yes Our training dataset is constructed from a combination of 800K objects sourced from Objaverse (Deitke et al., 2023) and 210K synthetic objects from Zeroverse (Xie et al., 2024). [...] For lighting diversity, we gathered over 8,000 HDR environment maps from multiple sources, including Polyhaven1, Laval Indoor (Gardner et al., 2017), Laval Outdoor (Hold-Geoffroy et al., 2019), internal datasets, and a selection of randomly generated Gaussian blobs.
Dataset Splits Yes The initial training phase employs four input views, four target denoising views (under target lighting, used for computing the diffusion loss), and two additional supervision views (under target lighting), all at a resolution of 256 256, with the environment map set to 128 256. The model is trained with a batch size of 512 for 80K iterations [...] Following this pretraining at the 256-resolution, we fine-tune the model for a larger context by increasing to six input views and six denosing target views at a higher resolution of 512 512. [...] We evaluate our method against these approaches on three publicly available datasets: STANFORD-ORB (Kuang et al., 2024), OBJECTS-WITH-LIGHTING (Ummenhofer et al., 2024), and TENSOIR-SYNTHETIC (Jin et al., 2023). The STANFORD-ORB dataset comprises 14 objects captured under three lighting conditions, with around 60 training views and 10 test views per lighting setup per object. The OBJECTS-WITH-LIGHTING dataset contains 7 objects with dense views captured under one training lighting condition and 3 views for two additional lighting conditions for testing. The TENSOIR-SYNTHETIC dataset consists of 4 objects with 100 training views under one lighting condition and 200 test views for each of five lighting conditions.
Hardware Specification Yes Relit LRM decodes 3D Gaussian (Kerbl et al., 2023) primitive parameters within approximately one second on a single A100 GPU. [...] Our transformer model [...] requires four days on 32 NVIDIA A100 GPUs (40GB VRAM each).
Software Dependencies No The paper mentions using 'Adam W optimizer', 'Ge LU activations', 'DDIM sampler', and 'classifier-free guidance technique' but does not specify software names with version numbers for libraries or frameworks used (e.g., PyTorch version, TensorFlow version).
Experiment Setup Yes The model is trained with a batch size of 512 for 80K iterations, introducing the perceptual loss after the first 5K iterations to enhance training stability. Following this pretraining at the 256-resolution, we fine-tune the model for a larger context by increasing to six input views and six denosing target views at a higher resolution of 512 512. This fine-tuning expands the context window to up to 31K tokens. For diffusion training, we discretize the noise into 1,000 timesteps, adhering to the method described in Ho et al. (2020), with a variance schedule that linearly increases from 0.00085 to 0.0120. To enable classifier-free guidance, environment map tokens are randomly masked to zero with a probability of 0.1 during training. [...] We use the Adam W optimizer with a peak learning rate of 4e 4 and a weight decay of 0.05. The β1, β2 are set to 0.9 and 0.95 respectively. We use 2000 iterations of warmup and start to introduce perceptual loss after 5000 iterations for training stability. We then finetune the model [...] with a reduced peak learning rate of 4e 5 and 1000 warmup steps. Throughout training, we apply gradient clipping at 1.0 and skip steps where the gradient norm exceeds 20.0.