MobileCLIP2: Improving Multi-Modal Reinforced Training

Authors: Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train a new family of models called Mobile CLIP2 and achieve state-of-the-art Image Net-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in Image Net-1k accuracy for Mobile CLIP2-B compared with Mobile CLIP-B architecture. Notably, Mobile CLIP2-S4 matches the zero-shot accuracy of Sig LIP-SO400M/14 on Image Net-1k while being 2 smaller and improves on DFN Vi T-L/14 at 2.5 lower latency.
Researcher Affiliation Industry Fartash Faghri EMAIL Apple; Pavan Kumar Anasosalu Vasu EMAIL Apple; Cem Koc EMAIL Apple; Vaishaal Shankar Work done while at Apple; Alexander Toshev EMAIL Apple; Oncel Tuzel EMAIL Apple; Hadi Pouransari EMAIL Apple
Pseudocode No The paper describes methods and equations, such as LDistill and LTotal, but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes We release our pretrained models 1 and the data generation code 2. The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing. 1https://github.com/apple/ml-mobileclip 2https://github.com/apple/ml-mobileclip-dr
Open Datasets Yes We train a new family of models, Mobile CLIP2, that establishes new state-of-the-art Image Net-1k accuracy at a range of latencies matching the performance of larger Sig LIP (Zhai et al., 2023) and DFN (Fang et al., 2024a) models while up to 4 smaller (our Mobile CLIP2-S2 compared with Sig LIP2-B/32) and up to 2.5 faster (our Mobile CLIP2-S4 compared with DFN Vi T-L/14). DFN (Fang et al., 2024a) proposed to filter data using a filtering network trained on high-quality data. Applying their model on Data Comp-12B pool resulted the DFN-2B dataset. They additionally collected a larger set of images from the web disjoint from Data Comp-12B and after filtering resulted in another 3B samples and collectively created the DFN-5B dataset. Data Comp (Gadre et al., 2023) demonstrated that the quality of large-scale image-text datasets can be significantly improved through filtering based on scores such as compatibility of image and text. We use Open CLIP to train Co Ca-Vi T-L/14 architecture (coca_Vi T-L-14). We pretrain models on DFN-2B and fine-tune on various datasets. Table 17 summarizes the hyperparameters for our Co Ca pretraining and fine-tuning. We additionally collected a larger set of images from the web disjoint from Data Comp-12B and after filtering resulted in another 3B samples and collectively created the DFN-5B dataset.
Dataset Splits Yes We create the reinforced dataset, DFNDR-2B, which contains five synthetic captions generated from our Co Ca-Vi T-L/14 model pretrained on DFN-2B and fine-tuned on MSCOCO-38K. All evaluations reported in the main paper are from single-scale evaluations on MS COCO validation set following prior works. We evaluate their performance on 38 zero-shot classification tasks (Gadre et al., 2023). In all ablations, we train Mobile CLIP-B for 30k iterations ( 20 epochs) on datasets with 12.8M images. We provide a summary of datasets in this paper in Tab. 15. Table 15: Summary of pretraining datasets. Dataset Num. Samples Data Comp-1B12M 12.8M DFN-2B12M 12.8M DFN-5B12M 12.8M Data Comp DR-12M 12.8M DFNDR-2B12M 12.8M DFNDR-5B12M 12.8M Data Comp-1B 1.3B DFN-2B 1.9B Data Comp DR-1B 1.3B DFNDR-2B 1.9B.
Hardware Specification Yes For training on 13B seen samples, we use either a setup with 32x8x A100-40GB GPUs or a setup with 16x8x H100-80GB GPUs. For our ablations we train for 30k seen samples using 4x8x H100-80GB GPUs and global batch size 8192. GPU Setup 32x8x A100-40GBs 1x8x H100-80GBs
Software Dependencies No All models were trained using MMDetection library Chen et al. (2019) on a single node with 8 A100 NVIDIA GPUs. All models were trained using MMSegmentation library Contributors (2020) on a single node with 8 A100 NVIDIA GPUs. We use Open CLIP (Ilharco et al., 2021) to train Co Ca-Vi T-L/14 architecture (coca_Vi T-L-14). We report vision-language evaluations using Mobile CLIP2 pretrained models in the LLa VA-1.5 setup (Liu et al., 2024a).
Experiment Setup Yes Table 16: Training hyperparameters for our CLIP experiments on DFNDR-2B. Hyperparameter S0 S2 B S3 S4 Input resolution 2562 2562 2242 2562 2562 Context length 77 Data augmentation Rand Augment Random resize crop scale [0.08, 1.0] Random resized crop ratio [0.75, 1.33] Range Augment target value (40, 20) Train iterations 200k Warmup iterations 10k 10k 2k 2k 2k Global batch size 65536 65536 65536 114688 114688 Optimizer Adam W Adam W beta1 0.9 Adam W beta2 0.95 Max learning rate 1e-3 Min learning rate 1e-6 1e-6 1e-6 0 0 LR. decay schedule cosine Weight decay rate 0.2 Gradient clipping 1.0 Mixed precision BFloat16 EMA decay rate 0.9995 No EMA No EMA No EMA No EMA CLIP loss weight 0.0 0.0 0.0 0.0 0.0 KD loss weight 1.0 1.0 1.0 1.0 1.0 GT caption weight 1.0 Synth. caption weight 1.0 Synth. teacher Co Ca-Vi T-L/14 DFN-2B MSCOCO-38k Teacher 1 DFN2B-CLIP-Vi T-L-14-s39b Teacher 2 DFN2B-CLIP-Vi T-L-14 Teacher 1 logit scale 70.0 Teacher 2 logit scale 60.0 Teacher resolution 224 224