Captured by Captions: On Memorization and its Mitigation in CLIP Models
Authors: Wenhao Wang, Adam Dziedzic, Grace Kim, Michael Backes, Franziska Boenisch
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our empirical study of memorization in CLIP using CLIPMem, we uncover several key findings. |
| Researcher Affiliation | Academia | 1CISPA, 2Georgia Institute of Technology |
| Pseudocode | No | The paper describes methods and formulas (e.g., Lalign(f, x) in Equation 1, CLIPMem in Equation 4) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'We build our experiments on Open CLIP (Cherti et al., 2023), an open-source Python version of Open-CLIP (Ilharco et al., 2021)' which refers to code used, not code released by the authors for their specific methodology. There is no explicit statement or link indicating the release of their own source code. |
| Open Datasets | Yes | Datasets. We use COCO (Lin et al., 2014), CC3M (Sharma et al., 2018), and the YFCC100M (Thomee et al., 2016a) datasets to pre-train the Open CLIP models. |
| Dataset Splits | Yes | Concretely, for COCO and CC3M, we set |SS| = 65000 and |SC| = |SI| = |SE| = 5000. |
| Hardware Specification | Yes | All the experiments in the paper are done on a server with 4 A100 (80 GB) GPUs and a work station with one RTX 4090 GPU(24 GB). |
| Software Dependencies | No | The paper mentions building experiments on 'Open CLIP (Cherti et al., 2023), an open-source Python version of Open-CLIP (Ilharco et al., 2021)' and using 'GPT-3.5-turbo' for caption generation. However, it does not provide specific version numbers for key software libraries or programming languages (e.g., Python, PyTorch/TensorFlow versions) that would be needed for reproducible setup. |
| Experiment Setup | Yes | Since COCO is much smaller than Open CLIP s standard training datasets, we reduce the training batch size to 128 and increase the epoch number from 32 to 100 to achieve similar performance. All other settings strictly follow Open CLIP. For training DINO, as an example of an SSL vision encoder, we follow the default setting of Caron et al. (2021). The supervised model is trained as a multi-label classifier, also based on Vi T-Base (with an additional fully connection layer) based on the first-level annotation captions in the COCO dataset. A full specification of our experimental setup is detailed in Appendix A.2. Additional experiments for measuring memorization on the BLIP (Li et al., 2022) model are presented in Appendix A.6. |