CR-MoE: Consistent Routed Mixture-of-Experts for Scaling Contrastive Learning
Authors: Ziyu Jiang, Guoqing Zheng, Yu Cheng, Ahmed Hassan Awadallah, Zhangyang Wang
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our findings validate CR-Mo E as an effective and efficient image representation learner. Code is available at https://github.com/VITA-Group/CRMo E. Extensive experiments verifies the effectiveness of the proposed regularization term. Compared to competitive state-of-the-art CL methods on Vi T, the proposed CR-Mo E achieves an improvement of 2.8 points at the same computational cost. Pre-training Our pre-training experiments are conducted on Image Net-1K (Deng et al., 2009) following common practice (Chen et al., 2020a; He et al., 2020). |
| Researcher Affiliation | Collaboration | Ziyu Jiang EMAIL Texas A&M University Guoqing Zheng EMAIL Microsoft Research Yu Cheng EMAIL The Chinese University of Hong Kong Ahmed Hassan Awadallah EMAIL Microsoft Research Zhangyang Wang EMAIL University of Texas at Austin |
| Pseudocode | No | The paper describes the proposed method, CR-Mo E, through text and a pipeline diagram (Figure 2). However, it does not contain a formally structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/VITA-Group/CRMo E. |
| Open Datasets | Yes | Our pre-training experiments are conducted on Image Net-1K (Deng et al., 2009) following common practice (Chen et al., 2020a; He et al., 2020). For transfer few-shot learning, we consider 4-shot and 10-shot settings for three datasets: CIFAR10 (Krizhevsky et al., 2009), Pet37 (Parkhi et al., 2012) and Food101 (Bossard et al., 2014). |
| Dataset Splits | Yes | For semi-supervised learning, we consider 1% or 10% available labels (following the sampling in Chen et al. (2020b)) of Image Net. For transfer few-shot learning, we consider 4-shot and 10-shot settings for three datasets: CIFAR10 (Krizhevsky et al., 2009), Pet37 (Parkhi et al., 2012) and Food101 (Bossard et al., 2014). |
| Hardware Specification | Yes | Models are pre-trained on 32 Nvidia V100 GPUs. For inference of a single image on one A6000 GPU, the time costs are 1.25ms and 1.07ms for VMo E-S/16 and Vi T/S-16, respectively. For training a batch of 1024 images on 8 A6000 GPUs, the time costs are 1.579s and 1.425s for VMo E-S/16 and Vi T/S-16, respectively. |
| Software Dependencies | No | Our implementation is based on Pytorch (Paszke et al., 2019) and Fast-Mo E (He et al., 2021a) library. The paper mentions PyTorch and Fast-MoE but does not specify their version numbers. |
| Experiment Setup | Yes | For pre-training framework, we employ Moco v3 (Chen et al., 2021b), and we follow the same settings as Moco v3 on data augmentations and learning specification: 3-layer MLP projection head, temperature τ = 0.2, momentum m = 0.99, random patch projection, cosine decay schedule (Loshchilov & Hutter, 2016), and 40-epoch warmup. For optimization, we employ Adam W (Loshchilov & Hutter, 2017) optimizer and a weight decay of 0.1. ... The best searched lr is 5.0e 4 Batch Size/256. For model ablations, we employ a shorter schedule of 100 epochs with a relatively small batch size of 1024. When comparing with state-of-the-art methods, we scale up and employ 300 epochs with a batch size of 3072. For Mo E network, we by default employ 16 expert candidates (ne = 16) and always activate 2 of them (k = 2). For the employed loss terms, we employ λ = 0.2, α = 0.3, wlb = 0.01 and w G = 0.001, which are searched on 100-epoch training. |