How Many Images Does It Take? Estimating Imitation Thresholds in Text-to-Image Models

Authors: Sahil Verma, Royi Rassin, Arnav Mohanty Das, Gantavya Bhatt, Preethi Seshadri, Chirag Shah, Jeff Bilmes, Hannaneh Hajishirzi, Yanai Elazar

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work we estimate the point at which a model was trained on enough instances of a concept to be able to imitate it – the imitation threshold. We posit this question as a new problem and propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training these models from scratch. We experiment with two domains – human faces and art styles, and evaluate four text-to-image models that were trained on three pretraining datasets. We estimate the imitation threshold of these models to be in the range of 200-700 images, depending on the domain and the model.
Researcher Affiliation Collaboration 1University of Washington, Seattle 2Bar-Ilan University 3University of California, Irvine 4Allen Institute of AI
Pseudocode Yes Algorithm 1: Collection of Reference Images for Human Face Imitation
Open Source Code Yes Code: https: //github.com/vsahil/MIMETIC-2. Website: https://how-many-van-goghs-does-it-take.github.io/. We have published all the code used for this work at https://github.com/vsahil/MIMETIC-2.git and included the README.md file that provides instructions to execute each step of the algorithm.
Open Datasets Yes We use Stable Diffusion (SD) as the text-to-image models (Rombach et al., 2022). We use them because both the models and their training datasets are open-sourced. Specifically, we use SD1.1 and SD1.5 that were trained on LAION2B-en, a 2.3 billion image-caption pairs dataset, filtered to contain only English captions and we use SD2.1 that was trained on LAION-5B, a 5.85 billion image-text pairs dataset that includes captions in any language (Schuhmann et al., 2022). Finally, we also use a latent diffusion model (LDM) trained on LAION-400M (Schuhmann et al., 2021), a 400M image-text pairs dataset. We use the images from the Wikiarts dataset (Saleh & Elgammal, 2016) as the (positive) art images and MS COCO dataset images (Lin et al., 2014) as the (negative) non-art images.
Dataset Splits Yes We generate 200 images per concept using different random seeds for each prompt, a total of 1,000 images per concept. We finetune SD1.4 on increasing number of images of these politicians, starting with 50 images (much below the imitation threshold found in the non-finetuning setup) and going upto 800 images (above the imitation threshold found in the non-finetuning setup). During the finetuning process, the images of the concepts were mixed with other 10,000 images taken randomly from LAION-2B-en dataset. This was done in order to ensure that our finetuning setup closely resembles the original training setup of SD1.4 where images of a concept would be naturally interspersed with other images in the dataset. We finetune on the full dataset of 10,000 images for 1 epoch with a learning rate of 5e 5.
Hardware Specification Yes We use 8 L40 GPUs to generate images for the all text-to-image models in our work.
Software Dependencies No The paper does not explicitly state specific version numbers for software dependencies such as Python, PyTorch, CUDA, or other libraries. It mentions using 'CLIP Vi T-H/14' but without a version number for CLIP or the model itself.
Experiment Setup Yes We generate images for each domain by prompting models with five prompts (Table 2). We design domain-specific prompts that encourage the desired concept to occupy a large part of the generated image, which simplifies the imitation score measurement. We also ensure that these prompts are distinct from the captions used in the pretraining dataset to minimize direct reproduction of training images (as noted by (Somepalli et al., 2023b)). We generate 200 images per concept using different random seeds for each prompt, a total of 1,000 images per concept. We use a face embedding model (Deng et al., 2022) for measuring face similarity and an art style embedding model (Somepalli et al., 2024) for measuring art style similarity. We finetune on the full dataset of 10,000 images for 1 epoch with a learning rate of 5e 5.