DepthFM: Fast Generative Monocular Depth Estimation with Flow Matching
Authors: Ming Gui, Johannes Schusterbauer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train our depth estimation model on two synthetic datasets, Hypersim (Roberts et al. 2021) and Virtual KITTI (Cabon, Murray, and Humenberger 2020) to cover both indoor and outdoor scenes. We perform zero-shot evaluations on established realworld depth estimation benchmarks NYUv2 (Nathan Silberman and Fergus 2012), KITTI (Behley et al. 2019), ETH3D (Schops et al. 2017), Scan Net (Dai et al. 2017), and DIODE (Vasiljevic et al. 2019). Table 2 compares our model quantitatively with state-of-the-art depth estimation methods. Ablation Studies |
| Researcher Affiliation | Academia | 1 Comp Vis @ LMU Munich, Munich Center for Machine Learning |
| Pseudocode | No | The paper describes methods using mathematical equations and textual explanations, but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Comp Vis/depth-fm |
| Open Datasets | Yes | We train our depth estimation model on two synthetic datasets, Hypersim (Roberts et al. 2021) and Virtual KITTI (Cabon, Murray, and Humenberger 2020) to cover both indoor and outdoor scenes. We leverage Metric3D v2 (Hu et al. 2024a), as our teacher model. We perform zero-shot evaluations on established realworld depth estimation benchmarks NYUv2 (Nathan Silberman and Fergus 2012), KITTI (Behley et al. 2019), ETH3D (Schops et al. 2017), Scan Net (Dai et al. 2017), and DIODE (Vasiljevic et al. 2019). On the high-resolution Middlebury-2014 dataset (Scharstein et al. 2014) |
| Dataset Splits | Yes | Following (Ke et al. 2024) we take 54K training samples from Hypersim and 20K training samples from Virtual KITTI. By training only on 74k synthetic samples and an additional 7.4k samples from a discriminative depth estimation method... we fine-tune our Depth FM to complete depth maps where only 2% of the ground truth pixels are available |
| Hardware Specification | No | The authors gratefully acknowledge the Gauss Center for Supercomputing for providing compute through the NIC on JUWELS at JSC and the HPC resources supplied by the Erlangen National High Performance Computing Center (NHR@FAU funded by DFG). While these are specific supercomputing centers, the paper does not specify exact GPU/CPU models, processor types, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers for libraries or frameworks used in the implementation. |
| Experiment Setup | Yes | Unless otherwise specified, we evaluate our model using an ensemble size of 10 and 4 Euler steps, and scale and shift our predictions to match the ground truth depth in log space. For Lo RA, we use rank 8 and keep the rest of the training details the same. Through empirical analysis in Table 9, we determine that a noise augmentation level of ts = 0.4 is optimal. |