DELTA: DENSE EFFICIENT LONG-RANGE 3D TRACKING FOR ANY VIDEO

Authors: Tuan Ngo, Peiye Zhuang, Evangelos Kalogerakis, Chuang Gan, Sergey Tulyakov, Hsin-Ying Lee, Chaoyang Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks.
Researcher Affiliation Collaboration Tuan Duc Ngo UMass Amherst Peiye Zhuang Snap Inc. Chuang Gan UMass Amherst Evangelos Kalogerakis UMass Amherst & TU Crete Sergey Tulyakov Snap Inc. Hsin-Ying Lee Snap Inc. Chaoyang Wang Snap Inc.
Pseudocode No The paper describes the method and architecture in detail using prose and figures, but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code No https://snap-research.github.io/DELTA/ is a project page, but the paper does not contain an unambiguous sentence stating that the authors are releasing code for the work described in this paper, nor is it explicitly a code repository.
Open Datasets Yes We leverage the Kubric simulator (Greff et al., 2022) to generate 5,632 training RGB-D videos and 143 testing videos... We use the CVO (Wu et al., 2023) test set... TAP-Vid3D (Koppula et al., 2024)... LSFOdyssey contains 90 40-frame videos derived from the Point Odyssey dataset (Zheng et al., 2023).
Dataset Splits Yes We leverage the Kubric simulator (Greff et al., 2022) to generate 5,632 training RGB-D videos and 143 testing videos... We use the CVO (Wu et al., 2023) test set, which originally includes two subsets: CVO-Clean and CVO-Final... We introduce an additional split, CVO-Extended, which includes 500 videos... We evaluate the performance of our approach across multiple tracking scenarios... 3D point tracking: We use two benchmarks: (1) TAP-Vid3D ... with total 4569 videos for evaluation... (2) LSFOdyssey contains 90 40-frame videos... During training, to save the GPU memory consumption, we randomly sample a patch of size N = 30 40 from the dense 3D trajectory map as supervision.
Hardware Specification Yes All stages are conducted on a machine with 8 A100 GPUs.
Software Dependencies No The paper mentions software components such as 'Adam W optimizer', 'linear one cycle (Smith & Topin, 2019)', 'Zoe Depth (Bhat et al., 2023)', and 'Uni Depth (Piccinelli et al., 2024)', but does not provide specific version numbers for any of them.
Experiment Setup Yes The total loss is defined as λ2d L2D + λdepth Ldepth + λvisib Lvisib... We empirically set weightings λ2d, λdepth, λvisib to be 100.0, 1.0, 0.1... The batch size is set to 1 for each GPU. The learning rate is initialized to 10 4 and scheduled by a linear one cycle... The input video is resized to 384 512 in both training and testing. The transformer network Φ is composed of 6 spatial and temporal attention blocks, utilizing 8 attention heads and 384 hidden channels. The number iteration step is set to 6. The number of anchor tracks is set to 9 12 during training. In the patch-wise dense local attention, we use a patch size of 6... In the high-resolution track upsampler, we use 9 neighbors (with κ = 3) and 2 cross-attention blocks. We first pretrain the model with 2D loss and visibility loss for 100k iterations, then train with the full loss for another 100k iterations.