Zero-shot CLIP Class Forgetting via Text-image Space Adaptation
Authors: Alexey Kravets, Vinay P. Namboodiri
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We do a performance comparison in Tab. 1 showing that our method both outperforms the previous methods and is more robust to different visual encoders achieving perfect class forgetting with Vi T (Dosovitskiy et al., 2020) and Res Net (He et al., 2015). We analyse through ablations the importance of the retain and forget loss components in Section 7.2 and how forget class projection place in the image-text space affects the forgetting ability of the model in Section 7.5. We find that retaining the knowledge of non-forget classes requires the inclusion of semantically similar classes, which can be generated using a large language model (LLM). This is because projecting the forget class to a different space primarily affects the closest classes in the image-text embedding space, thus, it is important to preserve this part of the space, while nonsemantically similar classes are retained without explicit inclusion. We conduct a thorough ablation analysis on how the number of semantically similar classes affects performance in Section 7.3. Additionally, in Section 7.4 we assess how including semantically different classes affects performance. |
| Researcher Affiliation | Academia | Alexey Kravets EMAIL Department of Computer Science University of Bath Vinay P. Namboodiri EMAIL Department of Computer Science University of Bath |
| Pseudocode | No | The paper describes the methodology using mathematical equations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The abstract states: "Full implementation can be found here." However, a direct link or specific reference to supplementary material is not provided in the parsed text. |
| Open Datasets | Yes | Following (Kravets & Namboodiri, 2025) we evaluate CLIP s forgetting capabilities on four high-quality, fine-grained datasets: Caltech101 (Fei-Fei et al., 2007) contains images from 101 distinct categories, each representing various objects or scenes. Stanford Cars (Krause et al., 2013) contains images of cars of different makes and models. Oxford Flowers (Nilsback & Zisserman, 2008) includes images of flowers of 102 different classes. Stanford Dogs (Khosla et al., 2011) comprises 120 classes of dogs of different species. We use Pins Faces (Burak) dataset that contains 105 celebrity faces for this purpose. Burak. Pins face recognition dataset. URL kaggle.com/datasets/hereisburak/pins-face-recognition. ...taken from the Food101 (Bossard et al., 2014) dataset... |
| Dataset Splits | No | The paper mentions using well-known datasets for evaluation (Caltech101, Stanford Cars, Oxford Flowers, Stanford Dogs, Pins Faces, Food101) but does not explicitly state the specific train/test/validation splits used for their experiments or if standard splits from these datasets were uniformly applied. The text focuses on evaluation methodology and metrics rather than dataset partitioning. |
| Hardware Specification | No | The paper mentions: "We ran experiments using two versions of CLIP where either Res Net50 or Vi T-B/16 visual encoders were used." and "Acknowledgements We d like to gratefully acknowledge Microsoft s compute support through Microsoft s Accelerating Foundation Models Research grant and the support from University of Bath for the studentship." However, it does not provide specific details such as GPU models, CPU types, or memory configurations. |
| Software Dependencies | No | The paper mentions using "Adam optimizer" and refers to "CLIP" as a model. However, it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or specific library versions). |
| Experiment Setup | Yes | We fix λ1 and λ3 while λ2 is is determined iteratively. At each iteration, we assess the reduction in the second component of the loss to evaluate whether the change in the projection matrix P is sufficient to project the forget class to the new chosen vector. We start from a fixed λ2 and increment it in small steps until the reduction in the second loss component exceeds 0.75% of its initial value. Additional implementation details are described in the Appendix. We ran experiments using two versions of CLIP where either Res Net50 or Vi T-B/16 visual encoders were used. For both the models we use the λ1 of 0.3, λ3 of 1 and a varying λ2 with initial value of 1.1 incrementing by 0.05 until the reduction in the second loss component exceeds 0.75% of its initial value. We optimize the low-ranking matrices A and B of rank r of 5 for 2000 iterations using Adam optimizer with learning rate of 0.01 and saving the weights that achieve the minimum loss. |