reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Authors: Paul Couairon, Clément Rambour, Jean-Emmanuel HAUGEARD, Nicolas THOME

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our quantitative and qualitative experiments show that Vid Edit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt.
Researcher Affiliation	Collaboration	Paul Couairon EMAIL Thales SIX GTS France, There SIS Lab, Palaiseau, France Sorbonne Université, CNRS, ISIR, F-75005 Paris, France
Pseudocode	No	The paper describes the Vid Edit framework with figures (Fig. 2, Fig. 3) and detailed textual descriptions of the steps involved in atlas editing and frame reconstruction. However, it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement that the authors' implementation code for Vid Edit is being released, nor does it provide a direct link to a code repository for the methodology described. It mentions using and adapting third-party tools like Control Net and Mask2former but does not provide its own source code.
Open Datasets	Yes	Following Bar-Tal et al. (2022); Wu et al. (2022); Qi et al. (2023), we evaluate our approach on videos from DAVIS dataset (Pont-Tuset et al., 2017) resized at a 768 432 resolution.
Dataset Splits	No	The paper mentions using the DAVIS dataset for evaluation but does not explicitly provide specific training, validation, or test split percentages, sample counts, or a detailed methodology for splitting the data for its own experiments. It refers to other works for dataset usage but does not state the precise splits used within this paper.
Hardware Specification	Yes	For a single 70 frames video, it takes 15 seconds to edit a 512 512 patch in an atlas and 1 minute to reconstruct the video with the edit layer on a NVIDIA TITAN RTX, a graphic card accessible to the general public.
Software Dependencies	No	The paper mentions several software components and models such as 'Control Net variant of Stable Diﬀusion' and 'Mask2former' as well as 'DDIM sampling' and 'classiﬁer-free guidance'. However, it does not provide specific version numbers for any of these or other key software dependencies.
Experiment Setup	Yes	To edit an atlas, we sample pure Gaussian noise (i.e. ﬂ= 1) and denoise it for 50 steps with DDIM sampling and classiﬁer-free guidance (Ho & Salimans, 2022). We set up the HED strength to 1 by default. On the right panel, we see that both local CLIP score and LPIPS increase with the noising ratio.