Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Depth Any Video with Scalable Synthetic Data
Authors: Honghui Yang, Di Huang, Wei Yin, Chunhua Shen, Haifeng Liu, Xiaofei He, Binbin Lin, Wanli Ouyang, Tong He
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency. The code and model weights are open-sourced. 4 EXPERIMENTS 4.1 DATASETS AND EVALUATION METRICS 4.2 IMPLEMENTATION DETAILS 4.3 ZERO-SHOT DEPTH ESTIMATION Quantitative Comparisons. Qualitative Comparisons. 4.4 ABLATION STUDIES |
| Researcher Affiliation | Collaboration | Honghui Yang1,2 , Di Huang4 , Wei Yin2, Chunhua Shen1, Haifeng Liu1 Xiaofei He1, Binbin Lin3, Wanli Ouyang2, Tong He2 1State Key Lab of CAD&CG, Zhejiang University 2Shanghai AI Laboratory 3School of Software Technology, Zhejiang University 4The University of Sydney |
| Pseudocode | No | The paper describes methods and processes using descriptive text and mathematical equations, without including any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | The code and model weights are open-sourced. |
| Open Datasets | Yes | In addition to the collected DA-V dataset, we follow Ke et al. (2024) by incorporating two single-frame synthetic datasets, Hypersim (Roberts et al., 2021) and Virtual KITTI 2 (Cabon et al., 2020). [...] For monocular depth estimation, we conduct a series of experiments to evaluate our model s performance on four widely used benchmarks. NYUv2 (Silberman et al., 2012) and Scan Net (Yeshwanth et al., 2023) provide RGB-D data from indoor environments captured using Kinect cameras. ETH3D (Schops et al., 2017) features both indoor and outdoor scenes, with depth data collected by a laser scanner. KITTI (Geiger et al., 2012) comprises outdoor driving scenes captured by cameras and Li DAR sensors. For video depth estimation, we sample 98 video clips from Scan Net++ (Yeshwanth et al., 2023)... |
| Dataset Splits | Yes | Hypersim is a photorealistic synthetic dataset featuring 461 indoor scenes, from which we use the official train and val split, totaling approximately 68K samples. |
| Hardware Specification | Yes | Experiments are conducted on 32 NVIDIA A100 GPUs for 20 epochs, with a total training time of approximately 1 day. [...] The runtime evaluation is performed on a single NVIDIA A100 GPU with a resolution of 480 640. |
| Software Dependencies | No | Our implementation is based on SVD (Blattmann et al., 2023a), using the diffusers library (von Platen et al., 2022). We employ the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 6.4 10 5. While software libraries are mentioned, specific version numbers for these or other key dependencies are not provided in the text. |
| Experiment Setup | Yes | Our implementation is based on SVD (Blattmann et al., 2023a), using the diffusers library (von Platen et al., 2022). We employ the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 6.4 10 5. The model is trained at various resolutions: 512 512, 480 640, 707 707, 352 1216, and 1024 1024, with corresponding batch sizes of 384, 256, 192, 128, and 64. The video length is sampled from 1 to 6, with the batch size adjusting correspondingly to meet GPU memory requirements. Experiments are conducted on 32 NVIDIA A100 GPUs for 20 epochs, with a total training time of approximately 1 day. [...] During inference, we set the number of denoising steps to 3 and the ensemble size to 20 for benchmark comparison, following Ke et al. (2024), to ensure optimal performance. |