Trajectory attention for fine-grained video motion control
Authors: Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, Xingang Pan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS 5.1 EXPERIMENTAL SETTINGS Datasets. We use Mira Data (Ju et al., 2024) for training, a large-scale video dataset with long-duration videos and structured captions, featuring realistic and dynamic scenes from games or daily life. We sample short video clips and apply Yang et al. (2023a) to extract optical flow as trajectory guidance. In total, we train with 10k video clips. Implementation Details. We conducted our main experiments using SVD (Blattmann et al., 2023), employing the Adam optimizer with a learning rate of 1e-5 per batch size, with mixed precision training of fp16. Metrics. We assessed the conditional generation performance using four distinct metrics: (1) Absolute Trajectory Error (ATE) (Goel et al., 1999), which quantifies the deviation between the estimated and actual trajectories of a camera or robot; and (2) Relative Pose Error (RPE) (Goel et al., 1999), which captures the drift in the estimated pose by separately calculating the translation (RPE-T) and rotation (RPE-R) errors. (3) Fr echet Inception Distance (FID) (Heusel et al., 2017), which evaluates the quality and variability of the generated views. |
| Researcher Affiliation | Collaboration | Zeqi Xiao1, Wenqi Ouyang1, Yifan Zhou1, Shuai Yang2, Lei Yang3, Jianlou Si3, Xingang Pan1 1S-Lab, Nanyang Technological University, 2Wangxuan Institute of Computer Technology, Peking University 3Sense Time Research EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Trajectory-based sampling Input: Hidden states Z RF H W C, where F is the number of frames, H, W are the spatial dimensions, and C is the number of channels. L trajectories Tr RL F 2, where each trajectory specifies F 2D locations. Trajectory masks M RF L, where Mf,l {0, 1} indicates whether a trajectory is valid at frame f for trajectory l. ... Algorithm 2: Back projection Input: Hidden states after attention Z t RF L C. L trajectories Tr RL F 2. Trajectory masks M RF L. 1 Initialize: Zp RF H W C, U RF H W , Zp = 0, U = 0 |
| Open Source Code | No | Project page at this URL. |
| Open Datasets | Yes | Datasets. We use Mira Data (Ju et al., 2024) for training, a large-scale video dataset with long-duration videos and structured captions, featuring realistic and dynamic scenes from games or daily life. |
| Dataset Splits | No | Datasets. We use Mira Data (Ju et al., 2024) for training, a large-scale video dataset with long-duration videos and structured captions, featuring realistic and dynamic scenes from games or daily life. We sample short video clips and apply Yang et al. (2023a) to extract optical flow as trajectory guidance. In total, we train with 10k video clips. |
| Hardware Specification | Yes | Our efficient training design allows for approximately 24 GPU hours of training (with a batch size of 1 on a single A100 GPU over the course of one day). |
| Software Dependencies | No | We conducted our main experiments using SVD (Blattmann et al., 2023), employing the Adam optimizer with a learning rate of 1e-5 per batch size, with mixed precision training of fp16. |
| Experiment Setup | Yes | We conducted our main experiments using SVD (Blattmann et al., 2023), employing the Adam optimizer with a learning rate of 1e-5 per batch size, with mixed precision training of fp16. We only fine-tune the additional trajectory attention modules which inherit weights from the temporal modules. |