Deconstructing Self-Supervised Monocular Reconstruction: The Design Decisions that Matter

Authors: Jaime Spencer, Chris Russell, Simon Hadfield, Richard Bowden

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents an open and comprehensive framework to systematically evaluate stateof-the-art contributions to self-supervised monocular depth estimation. This includes pretraining, backbone, architectural design choices and loss functions. [...] A systematic evaluation of papers in this field was not straightforward. [...] We re-implement, validate and re-evaluate 16 state-of-the-art contributions and introduce a new dataset (SYNS-Patches) containing dense outdoor depth maps in a variety of both natural and urban scenes.
Researcher Affiliation Collaboration Jaime Spencer EMAIL CVSSP, University of Surrey Chris Russell EMAIL Amazon Simon Hadfield EMAIL CVSSP, University of Surrey Richard Bowden EMAIL CVSSP, University of Surrey
Pseudocode No The paper provides mathematical equations describing the model and loss functions (e.g., equations 1-21 in Section A), but it does not include any structured pseudocode or algorithm blocks with numbered steps.
Open Source Code Yes To aid future research in this area, we release a modular codebase (https://github.com/jspenmar/monodepth_benchmark), allowing for easy evaluation of alternate design decisions against corrected data and evaluation criteria. [...] 1Code is publicly available at https://github.com/jspenmar/monodepth_benchmark.
Open Datasets Yes Our benchmark consists of two datasets: the Kitti Eigen-Benchmark (KEB) split (Uhrig et al., 2018) and the newly introduced SYNS-Patches dataset. [...] The second dataset in our benchmark is the novel SYNS-Patches, based on SYNS (Adams et al., 2016). The original SYNS is composed of 92 scenes from 9 different categories. [...] We make these resources available to the wider research community, contributing to the further advancement of self-supervised monocular depth estimation.
Dataset Splits Yes We train these models on the common Kitti Eigen-Zhou (KEZ) training split (Zhou et al., 2017), containing 39,810 frames from the KE split where static frames are discarded. Most previous works perform their ablation studies on the KE test set, where the final models are also evaluated. [...] We instead use a random set of 700 images from the KEZ validation split with updated ground-truth depth maps (Uhrig et al., 2018). [...] The final test set contains 1,175 of the possible 1,656 images. [...] Note that SYNS-Patches is purely a testing dataset never used for training.
Hardware Specification Yes Frames per second were measured on an NVIDIA GeForce RTX 3090 with an image of size 192 × 640.
Software Dependencies No The paper mentions software components like the 'timm library (Wightman, 2019)' and 'Adam' optimizer, but it does not specify exact version numbers for any of these components (e.g., 'PyTorch 1.9' or 'timm 0.5.4').
Experiment Setup Yes Models were trained for 30 epochs using Adam with a base learning rate of 1e-4, reduced to 1e-5 halfway through training. The default Depth Net backbone is a pretrained ConvNeXt-T (Liu et al., 2022), while Pose Net uses a pretrained ResNet-18 (He et al., 2016). We use an image resolution of 192 × 640 with a batch size of 8. Horizontal flips and color jittering are randomly applied with a probability of 0.5. [...] We use edge-aware smoothness regularization (Godard et al., 2017) with a scaling factor of 0.001. [...] each model variant is trained using three random seeds and mean performance is reported.