LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos

Authors: Yujun Shi, Jun Hao Liew, Hanshu Yan, Vincent Tan, Jiashi Feng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations confirm the superiority of our approach. The code and model are available at https://github.com/magicresearch/Lightning Drag. 5. Experiments 5.2. Evaluation on Drag Bench We provide a quantitative assessment of our method on Drag Bench (Shi et al., 2023), comprising 205 samples with pre-defined drag points and masks. As is standard (Shi et al., 2023; Ling et al., 2023; Cui et al., 2024; Liu et al., 2024), we use the Image Fidelity (IF) and Mean Distance (MD) metrics for our analysis. Table 3. Quantitative comparison on Drag Bench. IF and MD denote Image Fidelity (1-LPIPS) and Mean Distance, respectively. Table 4. Time efficiency. The reported time cost is obtained by running inference on 512x512 images sampled from Drag Bench (Shi et al., 2023) on a single NVIDIA A100 GPU. 5.4. Qualitative results Comparisons with Prior Methods. We compare our LIGHTNINGDRAG with prior methods in Fig. 7.
Researcher Affiliation Collaboration Yujun Shi * 1 Jun Hao Liew * 2 Hanshu Yan 2 Vincent Y. F. Tan 1 Jiashi Feng 2 *Equal contribution 1National University of Singapore 2Byte Dance Inc.. Correspondence to: Yujun Shi <EMAIL>, Jun Hao Liew <EMAIL>, Jiashi Feng <EMAIL>.
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations (e.g., Eqn. 3, 4, 5) and figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code and model are available at https://github.com/magicresearch/Lightning Drag.
Open Datasets Yes We provide a quantitative assessment of our method on Drag Bench (Shi et al., 2023), comprising 205 samples with pre-defined drag points and masks.
Dataset Splits No We sample 220k training samples from our internal video dataset to train our model. We set the learning rate to 5e-5 with a batch size of 256. We freeze both the inpainting U-Net and IP-Adapter, training both Appearance Encoder and Point Embedding Network. During training, we randomly sample [1, 20] points pairs. We randomly crop a square patch covering the sampled points and resize to 512x512. For evaluation, the paper states: "We provide a quantitative assessment of our method on Drag Bench (Shi et al., 2023), comprising 205 samples with pre-defined drag points and masks." The paper mentions the total number of training samples and evaluation samples but does not provide specific train/validation/test splits for the internal video dataset or describe a splitting methodology for it.
Hardware Specification Yes We report the time cost on an NVIDIA A100 GPU in Tab. 4.
Software Dependencies No The base inpainting U-Net inherits the pretrained weights from Stable Diffusion V1.5 inpainting model, whereas the Appearance Encoder is initialized from the pre-trained weights of Stable Diffusion V1.5. The Point Embedding Network is randomly initialized, except for the last convolution layer which is zero-initialized (Zhang et al., 2023) to ensure the model starts training as if no modification has been made. While the paper mentions models and methods like Stable Diffusion V1.5, IP-Adapter, DDIM, LCM-LoRA, and PeRFlow, it does not provide specific version numbers for programming languages (e.g., Python) or libraries (e.g., PyTorch, TensorFlow, CUDA) that would be needed to reproduce the experimental environment.
Experiment Setup Yes Training. We sample 220k training samples from our internal video dataset to train our model. We set the learning rate to 5e-5 with a batch size of 256. We freeze both the inpainting U-Net and IP-Adapter, training both Appearance Encoder and Point Embedding Network. During training, we randomly sample [1, 20] points pairs. We randomly crop a square patch covering the sampled points and resize to 512x512. Inference. We use DDIM (Song et al., 2021) sampling with 25 steps for inference by default. We found that our model is also compatible with recent diffusion acceleration techniques such as LCM-Lo RA (Luo et al., 2023b) and Pe RFlow (Yan et al., 2024) without additional training. When using LCM-Lo RA or Pe RFlow, we use 8 steps for sampling. We use guidance scale ωmax of 3.0 and adopt an inverse square decay (Sec. 4.3.2) that gradually reduces the guidance scale to 1.0 over time to prevent over-saturation issue.