DepthART: Monocular Depth Estimation as Autoregressive Refinement Task
Authors: Bulat Gabdullin, Nina Konovalova, Nikolay Patakin, Dmitry Senushkin, Anton Konushin
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that the proposed training approach significantly enhances the performance of VAR in depth estimation tasks. When trained on Hypersim dataset using our approach, the model achieves superior results across multiple unseen benchmarks compared to existing generative and discriminative baselines. |
| Researcher Affiliation | Academia | Bulat Gabdullin1,2 , Nina Konovalova1 , Nikolay Patakin1 , Dmitry Senushkin1 and Anton Konushin1 1AIRI, Moscow, Russia 2HSE University EMAIL |
| Pseudocode | No | The paper describes methods using prose and mathematical equations but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a footnote '1https://bulatko.github.io/depthart-pp/' next to the introduction of the work. This URL is a project demonstration page on GitHub Pages, not a direct link to a source-code repository. |
| Open Datasets | Yes | Due to the requirement of dense ground-truth depth maps for variational autoencoders, we utilize the highly realistic synthetic Hyper Sim dataset [Roberts et al., 2021], which includes 461 diverse indoor scenes. Evaluation is performed on four datasets unseen during training: NYUv2 [Silberman et al., 2012] and IBIMS [Koch et al., 2019] capturing indoor environments, TUM [Li et al., 2019] capturing dynamic humans in indoor environment, ETH3D [Schops et al., 2017] providing high-quality depth maps for outdoor environments. |
| Dataset Splits | No | The paper mentions using the Hypersim dataset for training and several other datasets for evaluation, but it does not specify any particular train/validation/test splits (e.g., percentages, sample counts, or methodology) for these datasets needed for reproduction. |
| Hardware Specification | Yes | Training of our model takes 17 hours using 4 NVIDIA H100 GPUs. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and a Step LR scheduler, but does not provide specific version numbers for any software libraries, programming languages, or frameworks used (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Visual Autoregressive Transformer is trained with Depth ART using Adam W [Loshchilov and Hutter, 2019] optimizer with a learning rate of 10 4 and weight decay of 10 2 and batch size equals to 4. Additionally we decrease learning rate during training with Step LR scheduler with a step size of 10, 000 and a gamma of 0.8. Training of our model takes 17 hours using 4 NVIDIA H100 GPUs. ... we train all models at this resolution [256 256]. |