DELTA: DENSE EFFICIENT LONG-RANGE 3D TRACKING FOR ANY VIDEO
Authors: Tuan Ngo, Peiye Zhuang, Evangelos Kalogerakis, Chuang Gan, Sergey Tulyakov, Hsin-Ying Lee, Chaoyang Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. |
| Researcher Affiliation | Collaboration | Tuan Duc Ngo UMass Amherst Peiye Zhuang Snap Inc. Chuang Gan UMass Amherst Evangelos Kalogerakis UMass Amherst & TU Crete Sergey Tulyakov Snap Inc. Hsin-Ying Lee Snap Inc. Chaoyang Wang Snap Inc. |
| Pseudocode | No | The paper describes the method and architecture in detail using prose and figures, but it does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | No | https://snap-research.github.io/DELTA/ is a project page, but the paper does not contain an unambiguous sentence stating that the authors are releasing code for the work described in this paper, nor is it explicitly a code repository. |
| Open Datasets | Yes | We leverage the Kubric simulator (Greff et al., 2022) to generate 5,632 training RGB-D videos and 143 testing videos... We use the CVO (Wu et al., 2023) test set... TAP-Vid3D (Koppula et al., 2024)... LSFOdyssey contains 90 40-frame videos derived from the Point Odyssey dataset (Zheng et al., 2023). |
| Dataset Splits | Yes | We leverage the Kubric simulator (Greff et al., 2022) to generate 5,632 training RGB-D videos and 143 testing videos... We use the CVO (Wu et al., 2023) test set, which originally includes two subsets: CVO-Clean and CVO-Final... We introduce an additional split, CVO-Extended, which includes 500 videos... We evaluate the performance of our approach across multiple tracking scenarios... 3D point tracking: We use two benchmarks: (1) TAP-Vid3D ... with total 4569 videos for evaluation... (2) LSFOdyssey contains 90 40-frame videos... During training, to save the GPU memory consumption, we randomly sample a patch of size N = 30 40 from the dense 3D trajectory map as supervision. |
| Hardware Specification | Yes | All stages are conducted on a machine with 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions software components such as 'Adam W optimizer', 'linear one cycle (Smith & Topin, 2019)', 'Zoe Depth (Bhat et al., 2023)', and 'Uni Depth (Piccinelli et al., 2024)', but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | The total loss is defined as λ2d L2D + λdepth Ldepth + λvisib Lvisib... We empirically set weightings λ2d, λdepth, λvisib to be 100.0, 1.0, 0.1... The batch size is set to 1 for each GPU. The learning rate is initialized to 10 4 and scheduled by a linear one cycle... The input video is resized to 384 512 in both training and testing. The transformer network Φ is composed of 6 spatial and temporal attention blocks, utilizing 8 attention heads and 384 hidden channels. The number iteration step is set to 6. The number of anchor tracks is set to 9 12 during training. In the patch-wise dense local attention, we use a patch size of 6... In the high-resolution track upsampler, we use 9 neighbors (with κ = 3) and 2 cross-attention blocks. We first pretrain the model with 2D loss and visibility loss for 100k iterations, then train with the full loss for another 100k iterations. |