reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Authors: Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, YINGCONG CHEN

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate Lotus, we conduct extensive experiments on two primary geometric dense prediction tasks: zero-shot monocular depth and normal estimation. The results demonstrate that Lotus achieves promising, and even superior, performance on these tasks across a wide range of evaluation datasets. Compared to traditional discriminative methods, Lotus delivers remarkable results with only 59K training samples. Among generative approaches, Lotus also outperforms previous methods in both accuracy and efficiency, being significantly faster than methods like Marigold (Ke et al., 2024) (Fig. 3). Beyond these improvements, Lotus seamlessly supports various applications, e.g., joint estimation, single/multi-view 3D reconstruction, etc.
Researcher Affiliation	Collaboration	Jing He1 Haodong Li1 Wei Yin2 Yixun Liang1 Leheng Li1 Kaiqiang Zhou3 Hongbo Zhang3 Bingbing Liu3 Yingcong Chen1,4 1HKUST(GZ) 2University of Adelaide 3Noah s Ark Lab 4HKUST EMAIL; EMAIL
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks. It uses diagrams (e.g., Figure 2, Figure 4, Figure 10) to illustrate the architecture and inference pipeline, but no step-by-step algorithms are provided in a code-like format.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for the methodology described, nor does it provide a direct link to a code repository. It mentions supplementary materials for more details on implementation but not for code access.
Open Datasets	Yes	Training Datasets. Both depth and normal estimation are trained on two synthetic dataset covering indoor and outdoor scenes: ①Hypersim (Roberts et al., 2021) is a photorealistic synthetic dataset featuring 461 indoor scenes. We use the official training split, which contains approximately 54K samples. After filtering out incomplete samples, around 39K samples remain, all resized to 576 × 768 for training. ②Virtual KITTI (Cabon et al., 2020) is a synthetic street-scene dataset with five urban scenes under various imaging and weather conditions. We utilize four of these scenes for training, comprising about 20K samples. All samples are cropped to 352 × 1216, with the far plane at 80m. Evaluation Datasets and Metrics. ①For zero-shot affine-invariant depth estimation, we evaluate Lotus on NYUv2 (Silberman et al., 2012), Scan Net (Dai et al., 2017), KITTI (Geiger et al., 2013), ETH3D (Schops et al., 2017), and DIODE (Vasiljevic et al., 2019) using absolute mean relative error (Abs Rel), and also report δ1 and δ2 values. ②For surface normal prediction, we employ NYUv2, Scan Net, i Bims-1 (Koch et al., 2018), Sintel (Butler et al., 2012) and OASIS (Chen et al., 2020) datasets, reporting mean angular error (m.) as well as the percentage of pixels with an angular error below 11.25 and 30 .
Dataset Splits	Yes	Training Datasets. Both depth and normal estimation are trained on two synthetic dataset covering indoor and outdoor scenes: ①Hypersim (Roberts et al., 2021) is a photorealistic synthetic dataset featuring 461 indoor scenes. We use the official training split, which contains approximately 54K samples. After filtering out incomplete samples, around 39K samples remain, all resized to 576 × 768 for training. ②Virtual KITTI (Cabon et al., 2020) is a synthetic street-scene dataset with five urban scenes under various imaging and weather conditions. We utilize four of these scenes for training, comprising about 20K samples. All samples are cropped to 352 × 1216, with the far plane at 80m. Following Marigold (Ke et al., 2024), we probabilistically choose one of the two datasets and then draw samples from it for each batch (Hypersim 90% and Virtual KITTI 10%).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory amounts. It only mentions 'graphic memory' in Figure 3 in the context of another model's requirements, not for their own experimental setup.
Software Dependencies	No	The paper states: 'We implement Lotus based on Stable Diffusion V2 (Rombach et al., 2022), without text conditioning.' While it mentions Stable Diffusion V2, it does not specify versions for other crucial software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Implementation details. We implement Lotus based on Stable Diffusion V2 (Rombach et al., 2022), without text conditioning. During training, we fix the time-step t = 1000. For depth estimation, we predict in disparity space, i.e., d = 1/d , where d represents the values in disparity space and d denotes the true depth. For more details, please see the supplementary materials. Following Marigold (Ke et al., 2024), we probabilistically choose one of the two datasets and then draw samples from it for each batch (Hypersim 90% and Virtual KITTI 10%).