Ctrl-V: Higher Fidelity Autonomous Vehicle Video Generation with Bounding-Box Controlled Object Motion

Authors: Ge Ya Luo, ZhiHao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on the KITTI, Virtual-KITTI 2, BDD100k, and nu Scenes datasets validate the effectiveness of our approach in producing realistic and controllable video generation. For quantitative evaluation, we assess the model s performance across four driving datasets on three key aspects: 1. The overall visual quality of the generated results (Section 4.3) 2. The alignment of the predicted bounding box trajectories with the ground truth (Section 4.2) 3. The fidelity of the generated objects in the video to the bounding box control signal (Section 4.4)
Researcher Affiliation Collaboration Ge Ya Luo EMAIL Mila, Université de Montréal; Zhi Hao Luo EMAIL Mila, Polytechnique Montréal; Anthony Gosselin EMAIL Mila, Polytechnique Montréal; Alexia Jolicoeur-Martineau EMAIL Samsung SAIT AI Lab, Montreal; Christopher Pal EMAIL Mila, Polytechnique Montréal Canada CIFAR AI Chair
Pseudocode No The paper describes the method using text and diagrams (Figure 1, Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Project page: https://oooolga.github.io/ctrl-v.github.io/
Open Datasets Yes We evaluate the performance of our models across four autonomous-vehicle datasets: KITTI (Geiger et al., 2013), Virtual KITTI 2 (v KITTI) (Cabon et al., 2020), Berkeley Driving Dataset (BDD) (Yu et al., 2020) with Multi-object Tracking labels (MOT2020), and the nu Scenes Dataset (Caesar et al., 2019).
Dataset Splits No The paper states, 'To assess video quality, we randomly select 200 initial frames from each dataset s testing set and generate videos.' However, it does not explicitly provide the specific training, validation, and test splits used for the models, nor does it refer to predefined splits with citations for reproducibility of the data partitioning.
Hardware Specification No The paper does not provide specific details regarding the hardware (e.g., GPU models, CPU types, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions using 'Stable Video Di!usion (SVD) models,' 'Control Net,' and 'YOLOv8 (Reis et al., 2024),' but it does not specify any version numbers for these software components or other ancillary software dependencies like programming languages or deep learning frameworks.
Experiment Setup No The paper describes the model architecture and general training strategy, such as using the Euler discrete noise scheduling method and freezing SVD weights during Control Net training. However, it does not provide specific numerical hyperparameters like learning rates, batch sizes, number of epochs, or optimizer configurations in the main text.