DIS-CO: Discovering Copyrighted Content in VLMs Training Data

Authors: André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess its effectiveness, we introduce Movie Tection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model s training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. We conduct experiments on two benchmarks, Movie Tection (our newly introduced dataset) and VL-MIA/Flickr (Li et al., 2024b).
Researcher Affiliation Academia 1Carnegie Mellon University 2INESC-ID / Instituto Superior T ecnico, ULisboa 3UC Berkeley. Correspondence to: Andr e V. Duarte <EMAIL>, Xuandong Zhao <EMAIL>, Arlindo L. Oliveira <EMAIL>, Lei Li <EMAIL>.
Pseudocode No The paper describes the methodology using textual explanations and diagrams (e.g., Figure 2 illustrating the pipeline), but no explicitly labeled 'Pseudocode' or 'Algorithm' block is present with structured, code-like formatting.
Open Source Code Yes Our code and data are available at https://github.com/avduarte333/DIS-CO
Open Datasets Yes Our code and data are available at https://github.com/avduarte333/DIS-CO. We conduct experiments on two benchmarks, Movie Tection (our newly introduced dataset) and VL-MIA/Flickr (Li et al., 2024b). Movie Tection contains 14,000 diverse movie frames paired with descriptive captions... Member images are sourced from a subset of COCO (Lin et al., 2014)
Dataset Splits Yes Movie Tection contains 14,000 diverse movie frames paired with descriptive captions, split chronologically based on films released before/after the models training cutoff (October 2023). VL-MIA/Flickr, derived from COCO (Lin et al., 2014) (member data) and recent Flickr images (non-member data), serves as a proof-of-validity dataset for DIS-CO. For each movie, we extract frames categorized into two types: main frames and neutral frames. In total, 140 frames are extracted per movie, comprising 100 main frames and 40 neutral ones. VL-MIA/Flickr... comprises 600 images divided evenly into member and non-member categories.
Hardware Specification Yes Most experiments with white-box models are conducted on a computing cluster equipped with four NVIDIA A100 80GB GPUs, allowing their efficient execution without requiring model quantization.
Software Dependencies Yes We utilize a diverse set of models, including GPT-4o (Open AI, 2024), Gemini-1.5 Pro (Reid et al., 2024), LLa MA-3.2 (Dubey et al., 2024), Qwen2-VL (Wang et al., 2024), LLa VA-v1.5 (Liu et al., 2023), and Pixtral (Agrawal et al., 2024). Fine-tuning is performed using the Qwen2-VL 7B model, leveraging Low-Rank Adaptation (Lo RA) as implemented in the Llama Factory framework (Zheng et al., 2024b).
Experiment Setup Yes When generating detailed captions for the frames, our model requires a certain level of creativity while staying truthful to the image content, therefore, we set the temperature=0.1 to achieve this. For evaluation, we aim for complete determinism, so the temperature parameter is fixed at 0. The number of training epochs is adjusted proportionally to the percentage of frames used, ensuring consistent exposure to the dataset. For instance, when training with the entire dataset (100%), we perform one epoch, whereas using half the dataset (50%) involves training for two epochs, effectively maintaining equivalent frame coverage across configurations.