Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Depth Any Video with Scalable Synthetic Data

Authors: Honghui Yang, Di Huang, Wei Yin, Chunhua Shen, Haifeng Liu, Xiaofei He, Binbin Lin, Wanli Ouyang, Tong He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency. The code and model weights are open-sourced. 4 EXPERIMENTS 4.1 DATASETS AND EVALUATION METRICS 4.2 IMPLEMENTATION DETAILS 4.3 ZERO-SHOT DEPTH ESTIMATION Quantitative Comparisons. Qualitative Comparisons. 4.4 ABLATION STUDIES
Researcher Affiliation Collaboration Honghui Yang1,2 , Di Huang4 , Wei Yin2, Chunhua Shen1, Haifeng Liu1 Xiaofei He1, Binbin Lin3, Wanli Ouyang2, Tong He2 1State Key Lab of CAD&CG, Zhejiang University 2Shanghai AI Laboratory 3School of Software Technology, Zhejiang University 4The University of Sydney
Pseudocode No The paper describes methods and processes using descriptive text and mathematical equations, without including any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes The code and model weights are open-sourced.
Open Datasets Yes In addition to the collected DA-V dataset, we follow Ke et al. (2024) by incorporating two single-frame synthetic datasets, Hypersim (Roberts et al., 2021) and Virtual KITTI 2 (Cabon et al., 2020). [...] For monocular depth estimation, we conduct a series of experiments to evaluate our model s performance on four widely used benchmarks. NYUv2 (Silberman et al., 2012) and Scan Net (Yeshwanth et al., 2023) provide RGB-D data from indoor environments captured using Kinect cameras. ETH3D (Schops et al., 2017) features both indoor and outdoor scenes, with depth data collected by a laser scanner. KITTI (Geiger et al., 2012) comprises outdoor driving scenes captured by cameras and Li DAR sensors. For video depth estimation, we sample 98 video clips from Scan Net++ (Yeshwanth et al., 2023)...
Dataset Splits Yes Hypersim is a photorealistic synthetic dataset featuring 461 indoor scenes, from which we use the official train and val split, totaling approximately 68K samples.
Hardware Specification Yes Experiments are conducted on 32 NVIDIA A100 GPUs for 20 epochs, with a total training time of approximately 1 day. [...] The runtime evaluation is performed on a single NVIDIA A100 GPU with a resolution of 480 640.
Software Dependencies No Our implementation is based on SVD (Blattmann et al., 2023a), using the diffusers library (von Platen et al., 2022). We employ the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 6.4 10 5. While software libraries are mentioned, specific version numbers for these or other key dependencies are not provided in the text.
Experiment Setup Yes Our implementation is based on SVD (Blattmann et al., 2023a), using the diffusers library (von Platen et al., 2022). We employ the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 6.4 10 5. The model is trained at various resolutions: 512 512, 480 640, 707 707, 352 1216, and 1024 1024, with corresponding batch sizes of 384, 256, 192, 128, and 64. The video length is sampled from 1 to 6, with the batch size adjusting correspondingly to meet GPU memory requirements. Experiments are conducted on 32 NVIDIA A100 GPUs for 20 epochs, with a total training time of approximately 1 day. [...] During inference, we set the number of denoising steps to 3 and the ensemble size to 20 for benchmark comparison, following Ke et al. (2024), to ensure optimal performance.