DIS-CO: Discovering Copyrighted Content in VLMs Training Data
Authors: André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To assess its effectiveness, we introduce Movie Tection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model s training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. We conduct experiments on two benchmarks, Movie Tection (our newly introduced dataset) and VL-MIA/Flickr (Li et al., 2024b). |
| Researcher Affiliation | Academia | 1Carnegie Mellon University 2INESC-ID / Instituto Superior T ecnico, ULisboa 3UC Berkeley. Correspondence to: Andr e V. Duarte <EMAIL>, Xuandong Zhao <EMAIL>, Arlindo L. Oliveira <EMAIL>, Lei Li <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using textual explanations and diagrams (e.g., Figure 2 illustrating the pipeline), but no explicitly labeled 'Pseudocode' or 'Algorithm' block is present with structured, code-like formatting. |
| Open Source Code | Yes | Our code and data are available at https://github.com/avduarte333/DIS-CO |
| Open Datasets | Yes | Our code and data are available at https://github.com/avduarte333/DIS-CO. We conduct experiments on two benchmarks, Movie Tection (our newly introduced dataset) and VL-MIA/Flickr (Li et al., 2024b). Movie Tection contains 14,000 diverse movie frames paired with descriptive captions... Member images are sourced from a subset of COCO (Lin et al., 2014) |
| Dataset Splits | Yes | Movie Tection contains 14,000 diverse movie frames paired with descriptive captions, split chronologically based on films released before/after the models training cutoff (October 2023). VL-MIA/Flickr, derived from COCO (Lin et al., 2014) (member data) and recent Flickr images (non-member data), serves as a proof-of-validity dataset for DIS-CO. For each movie, we extract frames categorized into two types: main frames and neutral frames. In total, 140 frames are extracted per movie, comprising 100 main frames and 40 neutral ones. VL-MIA/Flickr... comprises 600 images divided evenly into member and non-member categories. |
| Hardware Specification | Yes | Most experiments with white-box models are conducted on a computing cluster equipped with four NVIDIA A100 80GB GPUs, allowing their efficient execution without requiring model quantization. |
| Software Dependencies | Yes | We utilize a diverse set of models, including GPT-4o (Open AI, 2024), Gemini-1.5 Pro (Reid et al., 2024), LLa MA-3.2 (Dubey et al., 2024), Qwen2-VL (Wang et al., 2024), LLa VA-v1.5 (Liu et al., 2023), and Pixtral (Agrawal et al., 2024). Fine-tuning is performed using the Qwen2-VL 7B model, leveraging Low-Rank Adaptation (Lo RA) as implemented in the Llama Factory framework (Zheng et al., 2024b). |
| Experiment Setup | Yes | When generating detailed captions for the frames, our model requires a certain level of creativity while staying truthful to the image content, therefore, we set the temperature=0.1 to achieve this. For evaluation, we aim for complete determinism, so the temperature parameter is fixed at 0. The number of training epochs is adjusted proportionally to the percentage of frames used, ensuring consistent exposure to the dataset. For instance, when training with the entire dataset (100%), we perform one epoch, whereas using half the dataset (50%) involves training for two epochs, effectively maintaining equivalent frame coverage across configurations. |