reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tracking Everything Everywhere across Multiple Cameras

Authors: Li-Heng Wang, YuJu Cheng, Tyng-Luh Liu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments validate the method's ability to accurately establish point correspondences across cameras. Furthermore, our method achieves promising results of multiview pixel tracking without requiring the entire video sequences to be provided at once. We evaluate our method on two different benchmarks: the Plenoptic Video Dataset (Li et al. 2022) and the Immersive Video Dataset (Broxton et al. 2020), both of which are manually labeled for our experiments. The results demonstrate that our approach provides accurate point correspondences through long sequences across cameras, outperforming feature matching using DINO (Oquab et al. 2023).
Researcher Affiliation	Academia	1Institute of Information Science, Academia Sinica, Taiwan 2National Taiwan University
Pseudocode	Yes	Algorithm 1: Incremental Tracking Algorithm
Open Source Code	No	The paper does not provide any explicit statement about releasing code, a link to a code repository, or mention of code in supplementary materials.
Open Datasets	Yes	We collect multi-view data from the Plenoptic Video Dataset (Li et al. 2022) and the Immersive Video Dataset (Broxton et al. 2020). Both datasets consist of time-synchronized multi-view camera videos with significant view variations.
Dataset Splits	No	The paper mentions selecting four different views, training for 200,000 iterations, pre-training for the first 10 frames with 50,000 iterations, and then training for 2,000 iterations incrementally. It also mentions downsampling datasets and reducing frame rates, and manually labeling correspondences for ground truth. However, it does not provide specific training/validation/test splits (e.g., percentages or exact counts) for the overall dataset used for evaluation, nor does it refer to standard, predefined splits for reproducibility beyond citing the datasets themselves.
Hardware Specification	No	We thank National Center for High-performance Computing for providing computing resources.
Software Dependencies	No	The paper mentions using 'DINO (Oquab et al. 2023)' and 'PATS (Ni et al. 2023)' as methods or tools for comparisons and data collection, but does not specify the version numbers of any software libraries, frameworks (like PyTorch or TensorFlow), or other dependencies used for implementing their own methodology.
Experiment Setup	Yes	We select four different views and train for a total of 200,000 iterations for each scene. View selection is done by ensuring that each video has enough variance in terms of view angle and scene coverage. We employ a two-phase training process. In phase 1, we perform the warm-up training for 100,000 iterations by selecting a random view and reducing the problem to pixel correspondence learning, akin to the single-view scenario. Phase 2 involves learning pixel correspondences both within a single view and between different views. For the incremental tracking experiments, we pre-train our model for the first 10 frames with 50,000 iterations via the same procedure. For each timestep thereafter, we train for 2,000 iterations with incremental settings. L = Lp + λc Lc + λt Lt + λs Ls, where λc, λt, λs are the respective loss balancing weights.