SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
Authors: Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski, David Lindell
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to show superior performance over zero-shot baselines while significantly narrowing down the performance gap with supervised models in terms of visual quality and motion fidelity. |
| Researcher Affiliation | Academia | 1University of Toronto, 2Vector Institute |
| Pseudocode | No | The paper describes the methodology in text and through diagrams (e.g., Figure 3 provides an overview), but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Additional details and video results are available on our project page: https://kmcode1.github.io/Projects/SG-I2V. |
| Open Datasets | Yes | Following prior works (Wu et al., 2024c; Zhou et al., 2024), we evaluate our method on the validation set of the VIPSeg dataset (Miao et al., 2022). |
| Dataset Splits | Yes | Following prior works (Wu et al., 2024c; Zhou et al., 2024), we evaluate our method on the validation set of the VIPSeg dataset (Miao et al., 2022). We test on the same control regions and target trajectories as Drag Anything, where the size of our bounding boxes is the same as the diameter of the circles in their work. [...] Additionally, we exclude ground truth trajectory points that fall outside the image space due to objects moving out of frame. We also omit short videos with fewer than 14 frames from the evaluation. |
| Hardware Specification | Yes | The runtime depends on the number of trajectory conditions, with an average runtime of 305 seconds on the VIPSeg dataset with A6000 48GB. |
| Software Dependencies | No | The paper mentions the use of a "discrete Euler scheduler" (Karras et al., 2022), "Adam W optimizer" (Loshchilov & Hutter, 2019), "Co-Tracker" (Karaev et al., 2024), and a "Butterworth filter", but does not provide specific version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | In all experiments, we leverage the image-to-video variant of Stable Video Diffusion (Blattmann et al., 2023a) to generate videos with 14 frames and 576 1024 resolution. The default discrete Euler scheduler (Karras et al., 2022) is applied with T = 50 sampling steps. We extract feature maps from the last two self-attention layers from the middle stage in the denoising U-Net. We optimize Eq. (1) at the early denoising timesteps t [45, 44, ..., 30] for 5 iterations per timestep. We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 0.21. [...] During loss calculation, Gaussian heatmap Gb is constructed following (Wu et al., 2024c), where a heatmap for a bounding box of size (hb, wb) is created by Gaussian distribution with standard deviation σ = (0.2hb, 0.2wb). For the low-pass filter Hγ, we set the cut-off frequency γ to 0.5. |