VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing
Authors: Paul Couairon, Clément Rambour, Jean-Emmanuel HAUGEARD, Nicolas THOME
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our quantitative and qualitative experiments show that Vid Edit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. |
| Researcher Affiliation | Collaboration | Paul Couairon EMAIL Thales SIX GTS France, There SIS Lab, Palaiseau, France Sorbonne Université, CNRS, ISIR, F-75005 Paris, France |
| Pseudocode | No | The paper describes the Vid Edit framework with figures (Fig. 2, Fig. 3) and detailed textual descriptions of the steps involved in atlas editing and frame reconstruction. However, it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement that the authors' implementation code for Vid Edit is being released, nor does it provide a direct link to a code repository for the methodology described. It mentions using and adapting third-party tools like Control Net and Mask2former but does not provide its own source code. |
| Open Datasets | Yes | Following Bar-Tal et al. (2022); Wu et al. (2022); Qi et al. (2023), we evaluate our approach on videos from DAVIS dataset (Pont-Tuset et al., 2017) resized at a 768 432 resolution. |
| Dataset Splits | No | The paper mentions using the DAVIS dataset for evaluation but does not explicitly provide specific training, validation, or test split percentages, sample counts, or a detailed methodology for splitting the data for its own experiments. It refers to other works for dataset usage but does not state the precise splits used within this paper. |
| Hardware Specification | Yes | For a single 70 frames video, it takes 15 seconds to edit a 512 512 patch in an atlas and 1 minute to reconstruct the video with the edit layer on a NVIDIA TITAN RTX, a graphic card accessible to the general public. |
| Software Dependencies | No | The paper mentions several software components and models such as 'Control Net variant of Stable Diffusion' and 'Mask2former' as well as 'DDIM sampling' and 'classifier-free guidance'. However, it does not provide specific version numbers for any of these or other key software dependencies. |
| Experiment Setup | Yes | To edit an atlas, we sample pure Gaussian noise (i.e. fl= 1) and denoise it for 50 steps with DDIM sampling and classifier-free guidance (Ho & Salimans, 2022). We set up the HED strength to 1 by default. On the right panel, we see that both local CLIP score and LPIPS increase with the noising ratio. |