CR-MoE: Consistent Routed Mixture-of-Experts for Scaling Contrastive Learning

Authors: Ziyu Jiang, Guoqing Zheng, Yu Cheng, Ahmed Hassan Awadallah, Zhangyang Wang

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings validate CR-Mo E as an effective and efficient image representation learner. Code is available at https://github.com/VITA-Group/CRMo E. Extensive experiments verifies the effectiveness of the proposed regularization term. Compared to competitive state-of-the-art CL methods on Vi T, the proposed CR-Mo E achieves an improvement of 2.8 points at the same computational cost. Pre-training Our pre-training experiments are conducted on Image Net-1K (Deng et al., 2009) following common practice (Chen et al., 2020a; He et al., 2020).
Researcher Affiliation Collaboration Ziyu Jiang EMAIL Texas A&M University Guoqing Zheng EMAIL Microsoft Research Yu Cheng EMAIL The Chinese University of Hong Kong Ahmed Hassan Awadallah EMAIL Microsoft Research Zhangyang Wang EMAIL University of Texas at Austin
Pseudocode No The paper describes the proposed method, CR-Mo E, through text and a pipeline diagram (Figure 2). However, it does not contain a formally structured pseudocode or algorithm block.
Open Source Code Yes Code is available at https://github.com/VITA-Group/CRMo E.
Open Datasets Yes Our pre-training experiments are conducted on Image Net-1K (Deng et al., 2009) following common practice (Chen et al., 2020a; He et al., 2020). For transfer few-shot learning, we consider 4-shot and 10-shot settings for three datasets: CIFAR10 (Krizhevsky et al., 2009), Pet37 (Parkhi et al., 2012) and Food101 (Bossard et al., 2014).
Dataset Splits Yes For semi-supervised learning, we consider 1% or 10% available labels (following the sampling in Chen et al. (2020b)) of Image Net. For transfer few-shot learning, we consider 4-shot and 10-shot settings for three datasets: CIFAR10 (Krizhevsky et al., 2009), Pet37 (Parkhi et al., 2012) and Food101 (Bossard et al., 2014).
Hardware Specification Yes Models are pre-trained on 32 Nvidia V100 GPUs. For inference of a single image on one A6000 GPU, the time costs are 1.25ms and 1.07ms for VMo E-S/16 and Vi T/S-16, respectively. For training a batch of 1024 images on 8 A6000 GPUs, the time costs are 1.579s and 1.425s for VMo E-S/16 and Vi T/S-16, respectively.
Software Dependencies No Our implementation is based on Pytorch (Paszke et al., 2019) and Fast-Mo E (He et al., 2021a) library. The paper mentions PyTorch and Fast-MoE but does not specify their version numbers.
Experiment Setup Yes For pre-training framework, we employ Moco v3 (Chen et al., 2021b), and we follow the same settings as Moco v3 on data augmentations and learning specification: 3-layer MLP projection head, temperature τ = 0.2, momentum m = 0.99, random patch projection, cosine decay schedule (Loshchilov & Hutter, 2016), and 40-epoch warmup. For optimization, we employ Adam W (Loshchilov & Hutter, 2017) optimizer and a weight decay of 0.1. ... The best searched lr is 5.0e 4 Batch Size/256. For model ablations, we employ a shorter schedule of 100 epochs with a relatively small batch size of 1024. When comparing with state-of-the-art methods, we scale up and employ 300 epochs with a batch size of 3072. For Mo E network, we by default employ 16 expert candidates (ne = 16) and always activate 2 of them (k = 2). For the employed loss terms, we employ λ = 0.2, α = 0.3, wlb = 0.01 and w G = 0.001, which are searched on 100-epoch training.