Image Conductor: Precision Control for Interactive Video Synthesis

Authors: Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang, Ziyang Yuan, Liangbin Xie, Ying Shan, Yuexian Zou

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative and qualitative experiments demonstrate our method s precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis. Extensive experiments demonstrate the superiority of our method in precisely motion control, enabling the generation of videos from images that align with user desires. Experiments Comparisons with State-of-the-Art Methods. Ablation Studies.
Researcher Affiliation Collaboration 1School of Electronic and Computer Engineering, Peking University, Shenzhen, China 2ARC Lab, Tencent PCG, Shenzhen, China 3 Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Peking University Shenzhen Graduate School, Shenzhen, China 4 Nanyang Technological University, Singapore 5 Shenzhen International Graduate School, Tsinghua University, Shenzhen, China 6 University of Macau, Macao SAR 7Shenzhen Institute of Advanced Technology, Shenzhen, China
Pseudocode No The paper describes methods using prose and mathematical equations (e.g., Lcam, Lobj, Lortho, ˆϵθ0,θtrajs(xt, c)) but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code No Project Page https://liyaowei-stu.github.io/project/Image Conductor/. The provided URL is a project demonstration page, not an explicit link to a code repository, and the paper does not contain an unambiguous statement of code release.
Open Datasets Yes We leverage two datasets in our research: the Web Vid dataset (Bain et al. 2021), which is a large-scale mixed dataset with textual descriptions, and the Realestate10K dataset (Zhou et al. 2018), which is a camera-only dataset.
Dataset Splits No The paper describes data filtering and processing steps like 'We filter out the lowest 25% of video samples' and 'randomly sample a 32-frame sequence... for training'. It also mentions 'Camera-only Motion Evaluation Dataset' and 'Object-only Motion Evaluation Dataset' for evaluation. However, it does not provide specific training/validation/test splits for the primary datasets used to train the models, only how data sequences are prepared for training or specific evaluation sets.
Hardware Specification No The paper states 'Details are in the appendix' regarding implementation, but the main text does not explicitly describe any specific GPU models, CPU models, or other hardware specifications used for running the experiments.
Software Dependencies No The paper mentions several models and frameworks like 'Animatediff v3 (Guo et al. 2023b)', 'Sparse Ctrl (Guo et al. 2023a)', 'Co Tracker (Karaev et al. 2023)', 'RAFT (Teed and Deng 2020)', 'BLIP2(Li et al. 2023)', and 'Control Net (Zhang, Rao, and Agrawala 2023)', but it does not provide specific version numbers for general ancillary software like programming languages, libraries (e.g., PyTorch, TensorFlow), or operating systems.
Experiment Setup Yes We train only the motion Control Net while keeping the UNet backbone weights frozen. To standardize the dimensions of the training data, we perform center cropping on the previously obtained data, resulting in video frames of size 384 256 32. We heuristically sample n [1, 8] trajectories from the dense set, with 8 being the upper limit. The value of n is randomly selected, and the normalized motion intensity of each trajectory is used as the sampling probability. We apply a Gaussian filter to the trajectories.