Tracking Everything Everywhere across Multiple Cameras

Authors: Li-Heng Wang, YuJu Cheng, Tyng-Luh Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments validate the method's ability to accurately establish point correspondences across cameras. Furthermore, our method achieves promising results of multiview pixel tracking without requiring the entire video sequences to be provided at once. We evaluate our method on two different benchmarks: the Plenoptic Video Dataset (Li et al. 2022) and the Immersive Video Dataset (Broxton et al. 2020), both of which are manually labeled for our experiments. The results demonstrate that our approach provides accurate point correspondences through long sequences across cameras, outperforming feature matching using DINO (Oquab et al. 2023).
Researcher Affiliation Academia 1Institute of Information Science, Academia Sinica, Taiwan 2National Taiwan University
Pseudocode Yes Algorithm 1: Incremental Tracking Algorithm
Open Source Code No The paper does not provide any explicit statement about releasing code, a link to a code repository, or mention of code in supplementary materials.
Open Datasets Yes We collect multi-view data from the Plenoptic Video Dataset (Li et al. 2022) and the Immersive Video Dataset (Broxton et al. 2020). Both datasets consist of time-synchronized multi-view camera videos with significant view variations.
Dataset Splits No The paper mentions selecting four different views, training for 200,000 iterations, pre-training for the first 10 frames with 50,000 iterations, and then training for 2,000 iterations incrementally. It also mentions downsampling datasets and reducing frame rates, and manually labeling correspondences for ground truth. However, it does not provide specific training/validation/test splits (e.g., percentages or exact counts) for the overall dataset used for evaluation, nor does it refer to standard, predefined splits for reproducibility beyond citing the datasets themselves.
Hardware Specification No We thank National Center for High-performance Computing for providing computing resources.
Software Dependencies No The paper mentions using 'DINO (Oquab et al. 2023)' and 'PATS (Ni et al. 2023)' as methods or tools for comparisons and data collection, but does not specify the version numbers of any software libraries, frameworks (like PyTorch or TensorFlow), or other dependencies used for implementing their own methodology.
Experiment Setup Yes We select four different views and train for a total of 200,000 iterations for each scene. View selection is done by ensuring that each video has enough variance in terms of view angle and scene coverage. We employ a two-phase training process. In phase 1, we perform the warm-up training for 100,000 iterations by selecting a random view and reducing the problem to pixel correspondence learning, akin to the single-view scenario. Phase 2 involves learning pixel correspondences both within a single view and between different views. For the incremental tracking experiments, we pre-train our model for the first 10 frames with 50,000 iterations via the same procedure. For each timestep thereafter, we train for 2,000 iterations with incremental settings. L = Lp + λc Lc + λt Lt + λs Ls, where λc, λt, λs are the respective loss balancing weights.