Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
Authors: Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior Wolf, Gal Chechik
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed Additing Affordance Benchmark for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics. ... 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Yoad Tewel NVIDIA, Tel-Aviv University Rinon Gal NVIDIA, Tel-Aviv University Dvir Samuel Bar-Ilan University Yuval Atzmon NVIDIA Lior Wolf Tel-Aviv University Gal Chechik NVIDIA |
| Pseudocode | No | The paper describes the methodology in prose and mathematical equations within Section 3 ('Our Method') and its subsections, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | REPRODUCIBILITY STATEMENT: We will open-source all the code upon publication of the paper. |
| Open Datasets | Yes | We introduce the Additing Affordance Benchmark, where we manually annotate suitable areas for object insertion in images and propose a new protocol specifically designed to evaluate the plausibility of object placement. ... We also evaluate our method on an existing benchmark (Sheynin et al., 2023) with real images, as well as our newly proposed Additing Benchmark for generated images. ... We provide the proposed Additing Benchmark and Additing Affordance Benchmark in the supplementary material of our submission. |
| Dataset Splits | No | The paper mentions using a "subset of Emu Edit s (Sheynin et al., 2023) validation set" and constructing new benchmarks (Additing Benchmark, Additing Affordance Benchmark) with 100 or 200 sets/images, but it does not specify explicit train/test/validation splits (percentages or sample counts) for the experiments conducted by the authors. |
| Hardware Specification | No | The paper does not explicitly state the hardware specifications (GPU/CPU models, memory, etc.) used for running the experiments. It mentions using 'FLUX.1-dev model' but not the underlying hardware. |
| Software Dependencies | No | The paper mentions using 'diffusers implementation of the FLUX.1-dev model', 'SAM-2 (Ravi et al., 2024)', and 'Grounding-DINO (Liu et al., 2023)'. However, it does not provide specific version numbers for general software libraries, frameworks (like PyTorch or TensorFlow), or programming languages (like Python) that would be needed for replication. |
| Experiment Setup | Yes | When evaluating Add-it, we use tstruct = 933 for generated images and tstruct = 867 for real images and tblend = 500. For the scaling factor γ, we use the root-finding solver described in section 3.2 on a set of validation images and set γ to 1.05, as it is close to the average result and performs well in practice. We generate the images with 30 denoising steps, building upon the diffusers implementation of the FLUX.1-dev model. We apply the extended attention mechanism until step t = 670 in the multi-stream blocks, and step t = 340 for the single-stream blocks. |