Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Authors: Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, YINGCONG CHEN

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate Lotus, we conduct extensive experiments on two primary geometric dense prediction tasks: zero-shot monocular depth and normal estimation. The results demonstrate that Lotus achieves promising, and even superior, performance on these tasks across a wide range of evaluation datasets. Compared to traditional discriminative methods, Lotus delivers remarkable results with only 59K training samples. Among generative approaches, Lotus also outperforms previous methods in both accuracy and efficiency, being significantly faster than methods like Marigold (Ke et al., 2024) (Fig. 3). Beyond these improvements, Lotus seamlessly supports various applications, e.g., joint estimation, single/multi-view 3D reconstruction, etc.
Researcher Affiliation Collaboration Jing He1 Haodong Li1 Wei Yin2 Yixun Liang1 Leheng Li1 Kaiqiang Zhou3 Hongbo Zhang3 Bingbing Liu3 Yingcong Chen1,4 1HKUST(GZ) 2University of Adelaide 3Noah s Ark Lab 4HKUST EMAIL; EMAIL
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. It uses diagrams (e.g., Figure 2, Figure 4, Figure 10) to illustrate the architecture and inference pipeline, but no step-by-step algorithms are provided in a code-like format.
Open Source Code No The paper does not contain an explicit statement about the release of source code for the methodology described, nor does it provide a direct link to a code repository. It mentions supplementary materials for more details on implementation but not for code access.
Open Datasets Yes Training Datasets. Both depth and normal estimation are trained on two synthetic dataset covering indoor and outdoor scenes: ①Hypersim (Roberts et al., 2021) is a photorealistic synthetic dataset featuring 461 indoor scenes. We use the official training split, which contains approximately 54K samples. After filtering out incomplete samples, around 39K samples remain, all resized to 576 × 768 for training. ②Virtual KITTI (Cabon et al., 2020) is a synthetic street-scene dataset with five urban scenes under various imaging and weather conditions. We utilize four of these scenes for training, comprising about 20K samples. All samples are cropped to 352 × 1216, with the far plane at 80m. Evaluation Datasets and Metrics. ①For zero-shot affine-invariant depth estimation, we evaluate Lotus on NYUv2 (Silberman et al., 2012), Scan Net (Dai et al., 2017), KITTI (Geiger et al., 2013), ETH3D (Schops et al., 2017), and DIODE (Vasiljevic et al., 2019) using absolute mean relative error (Abs Rel), and also report δ1 and δ2 values. ②For surface normal prediction, we employ NYUv2, Scan Net, i Bims-1 (Koch et al., 2018), Sintel (Butler et al., 2012) and OASIS (Chen et al., 2020) datasets, reporting mean angular error (m.) as well as the percentage of pixels with an angular error below 11.25 and 30 .
Dataset Splits Yes Training Datasets. Both depth and normal estimation are trained on two synthetic dataset covering indoor and outdoor scenes: ①Hypersim (Roberts et al., 2021) is a photorealistic synthetic dataset featuring 461 indoor scenes. We use the official training split, which contains approximately 54K samples. After filtering out incomplete samples, around 39K samples remain, all resized to 576 × 768 for training. ②Virtual KITTI (Cabon et al., 2020) is a synthetic street-scene dataset with five urban scenes under various imaging and weather conditions. We utilize four of these scenes for training, comprising about 20K samples. All samples are cropped to 352 × 1216, with the far plane at 80m. Following Marigold (Ke et al., 2024), we probabilistically choose one of the two datasets and then draw samples from it for each batch (Hypersim 90% and Virtual KITTI 10%).
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory amounts. It only mentions 'graphic memory' in Figure 3 in the context of another model's requirements, not for their own experimental setup.
Software Dependencies No The paper states: 'We implement Lotus based on Stable Diffusion V2 (Rombach et al., 2022), without text conditioning.' While it mentions Stable Diffusion V2, it does not specify versions for other crucial software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Implementation details. We implement Lotus based on Stable Diffusion V2 (Rombach et al., 2022), without text conditioning. During training, we fix the time-step t = 1000. For depth estimation, we predict in disparity space, i.e., d = 1/d , where d represents the values in disparity space and d denotes the true depth. For more details, please see the supplementary materials. Following Marigold (Ke et al., 2024), we probabilistically choose one of the two datasets and then draw samples from it for each batch (Hypersim 90% and Virtual KITTI 10%).