DepthART: Monocular Depth Estimation as Autoregressive Refinement Task

Authors: Bulat Gabdullin, Nina Konovalova, Nikolay Patakin, Dmitry Senushkin, Anton Konushin

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that the proposed training approach significantly enhances the performance of VAR in depth estimation tasks. When trained on Hypersim dataset using our approach, the model achieves superior results across multiple unseen benchmarks compared to existing generative and discriminative baselines.
Researcher Affiliation Academia Bulat Gabdullin1,2 , Nina Konovalova1 , Nikolay Patakin1 , Dmitry Senushkin1 and Anton Konushin1 1AIRI, Moscow, Russia 2HSE University EMAIL
Pseudocode No The paper describes methods using prose and mathematical equations but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper provides a footnote '1https://bulatko.github.io/depthart-pp/' next to the introduction of the work. This URL is a project demonstration page on GitHub Pages, not a direct link to a source-code repository.
Open Datasets Yes Due to the requirement of dense ground-truth depth maps for variational autoencoders, we utilize the highly realistic synthetic Hyper Sim dataset [Roberts et al., 2021], which includes 461 diverse indoor scenes. Evaluation is performed on four datasets unseen during training: NYUv2 [Silberman et al., 2012] and IBIMS [Koch et al., 2019] capturing indoor environments, TUM [Li et al., 2019] capturing dynamic humans in indoor environment, ETH3D [Schops et al., 2017] providing high-quality depth maps for outdoor environments.
Dataset Splits No The paper mentions using the Hypersim dataset for training and several other datasets for evaluation, but it does not specify any particular train/validation/test splits (e.g., percentages, sample counts, or methodology) for these datasets needed for reproduction.
Hardware Specification Yes Training of our model takes 17 hours using 4 NVIDIA H100 GPUs.
Software Dependencies No The paper mentions using Adam W optimizer and a Step LR scheduler, but does not provide specific version numbers for any software libraries, programming languages, or frameworks used (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Visual Autoregressive Transformer is trained with Depth ART using Adam W [Loshchilov and Hutter, 2019] optimizer with a learning rate of 10 4 and weight decay of 10 2 and batch size equals to 4. Additionally we decrease learning rate during training with Step LR scheduler with a step size of 10, 000 and a gamma of 0.8. Training of our model takes 17 hours using 4 NVIDIA H100 GPUs. ... we train all models at this resolution [256 256].