reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Trajectory attention for fine-grained video motion control

Authors: Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTS 5.1 EXPERIMENTAL SETTINGS Datasets. We use Mira Data (Ju et al., 2024) for training, a large-scale video dataset with long-duration videos and structured captions, featuring realistic and dynamic scenes from games or daily life. We sample short video clips and apply Yang et al. (2023a) to extract optical flow as trajectory guidance. In total, we train with 10k video clips. Implementation Details. We conducted our main experiments using SVD (Blattmann et al., 2023), employing the Adam optimizer with a learning rate of 1e-5 per batch size, with mixed precision training of fp16. Metrics. We assessed the conditional generation performance using four distinct metrics: (1) Absolute Trajectory Error (ATE) (Goel et al., 1999), which quantifies the deviation between the estimated and actual trajectories of a camera or robot; and (2) Relative Pose Error (RPE) (Goel et al., 1999), which captures the drift in the estimated pose by separately calculating the translation (RPE-T) and rotation (RPE-R) errors. (3) Fr echet Inception Distance (FID) (Heusel et al., 2017), which evaluates the quality and variability of the generated views.
Researcher Affiliation	Collaboration	Zeqi Xiao1, Wenqi Ouyang1, Yifan Zhou1, Shuai Yang2, Lei Yang3, Jianlou Si3, Xingang Pan1 1S-Lab, Nanyang Technological University, 2Wangxuan Institute of Computer Technology, Peking University 3Sense Time Research EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Trajectory-based sampling Input: Hidden states Z RF H W C, where F is the number of frames, H, W are the spatial dimensions, and C is the number of channels. L trajectories Tr RL F 2, where each trajectory specifies F 2D locations. Trajectory masks M RF L, where Mf,l {0, 1} indicates whether a trajectory is valid at frame f for trajectory l. ... Algorithm 2: Back projection Input: Hidden states after attention Z t RF L C. L trajectories Tr RL F 2. Trajectory masks M RF L. 1 Initialize: Zp RF H W C, U RF H W , Zp = 0, U = 0
Open Source Code	No	Project page at this URL.
Open Datasets	Yes	Datasets. We use Mira Data (Ju et al., 2024) for training, a large-scale video dataset with long-duration videos and structured captions, featuring realistic and dynamic scenes from games or daily life.
Dataset Splits	No	Datasets. We use Mira Data (Ju et al., 2024) for training, a large-scale video dataset with long-duration videos and structured captions, featuring realistic and dynamic scenes from games or daily life. We sample short video clips and apply Yang et al. (2023a) to extract optical flow as trajectory guidance. In total, we train with 10k video clips.
Hardware Specification	Yes	Our efficient training design allows for approximately 24 GPU hours of training (with a batch size of 1 on a single A100 GPU over the course of one day).
Software Dependencies	No	We conducted our main experiments using SVD (Blattmann et al., 2023), employing the Adam optimizer with a learning rate of 1e-5 per batch size, with mixed precision training of fp16.
Experiment Setup	Yes	We conducted our main experiments using SVD (Blattmann et al., 2023), employing the Adam optimizer with a learning rate of 1e-5 per batch size, with mixed precision training of fp16. We only fine-tune the additional trajectory attention modules which inherit weights from the temporal modules.