Trajectory attention for fine-grained video motion control

Authors: Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS 5.1 EXPERIMENTAL SETTINGS Datasets. We use Mira Data (Ju et al., 2024) for training, a large-scale video dataset with long-duration videos and structured captions, featuring realistic and dynamic scenes from games or daily life. We sample short video clips and apply Yang et al. (2023a) to extract optical flow as trajectory guidance. In total, we train with 10k video clips. Implementation Details. We conducted our main experiments using SVD (Blattmann et al., 2023), employing the Adam optimizer with a learning rate of 1e-5 per batch size, with mixed precision training of fp16. Metrics. We assessed the conditional generation performance using four distinct metrics: (1) Absolute Trajectory Error (ATE) (Goel et al., 1999), which quantifies the deviation between the estimated and actual trajectories of a camera or robot; and (2) Relative Pose Error (RPE) (Goel et al., 1999), which captures the drift in the estimated pose by separately calculating the translation (RPE-T) and rotation (RPE-R) errors. (3) Fr echet Inception Distance (FID) (Heusel et al., 2017), which evaluates the quality and variability of the generated views.
Researcher Affiliation Collaboration Zeqi Xiao1, Wenqi Ouyang1, Yifan Zhou1, Shuai Yang2, Lei Yang3, Jianlou Si3, Xingang Pan1 1S-Lab, Nanyang Technological University, 2Wangxuan Institute of Computer Technology, Peking University 3Sense Time Research EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1: Trajectory-based sampling Input: Hidden states Z RF H W C, where F is the number of frames, H, W are the spatial dimensions, and C is the number of channels. L trajectories Tr RL F 2, where each trajectory specifies F 2D locations. Trajectory masks M RF L, where Mf,l {0, 1} indicates whether a trajectory is valid at frame f for trajectory l. ... Algorithm 2: Back projection Input: Hidden states after attention Z t RF L C. L trajectories Tr RL F 2. Trajectory masks M RF L. 1 Initialize: Zp RF H W C, U RF H W , Zp = 0, U = 0
Open Source Code No Project page at this URL.
Open Datasets Yes Datasets. We use Mira Data (Ju et al., 2024) for training, a large-scale video dataset with long-duration videos and structured captions, featuring realistic and dynamic scenes from games or daily life.
Dataset Splits No Datasets. We use Mira Data (Ju et al., 2024) for training, a large-scale video dataset with long-duration videos and structured captions, featuring realistic and dynamic scenes from games or daily life. We sample short video clips and apply Yang et al. (2023a) to extract optical flow as trajectory guidance. In total, we train with 10k video clips.
Hardware Specification Yes Our efficient training design allows for approximately 24 GPU hours of training (with a batch size of 1 on a single A100 GPU over the course of one day).
Software Dependencies No We conducted our main experiments using SVD (Blattmann et al., 2023), employing the Adam optimizer with a learning rate of 1e-5 per batch size, with mixed precision training of fp16.
Experiment Setup Yes We conducted our main experiments using SVD (Blattmann et al., 2023), employing the Adam optimizer with a learning rate of 1e-5 per batch size, with mixed precision training of fp16. We only fine-tune the additional trajectory attention modules which inherit weights from the temporal modules.