CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
Authors: Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi, Mohammad Rastegari, Sachin Mehta
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We probe CLIPTe X and other pre-trained models on different downstream tasks and multiple datasets using classifier or regressor probes. This helps us understand if training with hard pseudo-labels from experts can improve the effectiveness of CLIP s image representations across different vision tasks. Experiments with multiple probes on variety of vision tasks and datasets (e.g., segmentation on PASCAL VOC and ADE20k, detection on COCO, depth estimation on NYU-v2, classification on Image Net-1k and Places-365, and surface normal estimation on NYU-v2) demonstrate the effectiveness of CLIPTe X. |
| Researcher Affiliation | Collaboration | Mohammadreza Salehi University of Washington; Mehrdad Farajtabar Maxwell Horton Fartash Faghri Hadi Pouransari Raviteja Vemulapalli Oncel Tuzel Apple; Ali Farhadi Allen Institute for Artificial Intelligence; Mohammad Rastegari Sachin Mehta Apple |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. The methodology is described in natural language and mathematical formulas. |
| Open Source Code | No | The paper does not contain any explicit statement about providing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We finetune pre-trained CLIP with and without pseudo-labels on CC3M (Sharma et al., 2018) for 30 epochs on 64 A100 GPUs. Semantic segmentation. We use PASCAL VOC (Everingham et al., 2010) with 20 classes and ADE20K (Zhou et al., 2019) with 150 classes for the task of semantic segmentation. Object detection and instance segmentation. We use the COCO dataset for detection and instance segmentation. Monocular depth estimation. We use NYU-V2 (Silberman et al., 2012) dataset as our depth estimation benchmark. Image classification. We evaluate on two standard image classification datasets, i.e., Image Net (Russakovsky et al., 2015) and Places365 (Zhou et al., 2017). retrieval on Flickr-30k (Young et al., 2014) |
| Dataset Splits | Yes | Following a standard convention, we report the accuracy on the validation sets of these datasets in terms of mean intersection over union (m Io U). Following standard convention, we evaluate the accuracy on COCO s validation set in terms of mean average precision (m AP). We use absolute relative error as a metric for evaluation on the validation set. evaluate on the official test set of NYU-V2. We use top-1 accuracy on the validation set as an evaluation metric. |
| Hardware Specification | Yes | Therefore, to show the efficacy of our approach, we finetune pre-trained CLIP with and without pseudo-labels on CC3M (Sharma et al., 2018) for 30 epochs on 64 A100 GPUs. |
| Software Dependencies | No | The paper mentions various models and architectures (e.g., Mask-RCNN, DPT, NLL-Ang MF, Deep Labv3, PSPNet, SSD) but does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | A Hyperparameters. Hyper-parameters used during training and probing CLIPTe X and other models are given in Table 8 and Table 9 respectively. For selecting λclip and λtask (where task = {depth, seg, surface normal} in Eq. (1), linear search was used. The value of λclip and λtask was chosen as 0.1, 1.0, and 10 for each of the tasks. We found that λclip = λtask = 1.0 worked well for all tasks except segmentation where we found λseg = 0.1 delivered the best or close to the best performance. So, we set these hyper-parameters in our experiments. |