$f$-MICL: Understanding and Generalizing InfoNCE-based Contrastive Learning
Authors: Yiwei Lu, Guojun Zhang, Sun Sun, Hongyu Guo, Yaoliang Yu
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using benchmark tasks from both vision and natural language, we empirically evaluate f MICL with different f-divergences on various architectures (Sim CLR, Mo Co, and Mo Co v3) and datasets. We observe that f-MICL generally outperforms the benchmarks and the best-performing f-divergence is task and dataset dependent. |
| Researcher Affiliation | Collaboration | Yiwei Lu EMAIL School of Computer Science University of Waterloo Vector Institute Guojun Zhang EMAIL Huawei Noah s Ark Lab Sun Sun EMAIL School of Computer Science University of Waterloo National Research Council Canada Hongyu Guo EMAIL National Research Council Canada University of Ottawa Yaoliang Yu EMAIL School of Computer Science University of Waterloo Vector Institute |
| Pseudocode | Yes | Algorithm 1: f-MICL Input: batch size N, function f, weighting parameter α, constant µ (in Gσ), variance σ2 |
| Open Source Code | No | In this paper, we follow the implementations in Sim CLR (https://github.com/sthalles/Sim CLR) and Mo Co v3 (https://github.com/facebookresearch/moco-v3). For fair comparison we use the experimental settings in Table 7 for all the baseline methods, which might differ from the original settings. Table 7 gives common choices of hyperparameters for different datasets. Note that we may need to further finetune α and σ for different f-divergences. See our supplementary code for more details. |
| Open Datasets | Yes | Our vision datasets include CIFAR-10 (Krizhevsky et al., 2009), STL-10 (Coates et al., 2011), Tiny Image Net (Chrabaszcz et al., 2017), and Image Net (Deng et al., 2009) for image classification. To show the wide applicability of our f-MICL framework, we also conduct experiments on a natural language dataset, English Wikipedia (Gao et al., 2021). |
| Dataset Splits | Yes | Evaluation metric: for vision tasks, we use k-nearest-neighbor (k-NN) (only small datasets) and linear evaluation to evaluate the performance, based on the learned embeddings. For each sample in a dataset we create a sample pair, a.k.a. positive pair, using two different augmentation functions. For image samples, we choose the augmentation functions to be the standard ones in contrastive learning, e.g., in Chen et al. (2020) and He et al. (2020). |
| Hardware Specification | Yes | Hardware and package: We train on a GPU cluster with NVIDIA T4 and P100. |
| Software Dependencies | No | Hardware and package: We train on a GPU cluster with NVIDIA T4 and P100. The platform we use is pytorch. Specifically, the pairwise summation can be easily implemented using torch.nn.functional.pdist from pytorch. |
| Experiment Setup | Yes | Batch size and embedding dimension: for experiments in CIFAR-10 we choose batch size 512; for STL-10 we choose batch size 64 to accommodate one GPU training; for Tiny Image Net, we choose batch size 256; for Image Net, we choose batch size 1024. For all the vision datasets, we choose the embedding dimension to be 512. Regarding the language dataset, the batch size is 64 with the feature dimension 768. Hyperparameters: in all our experiments we fix the constant factor µ = 1. We find that in practice the weight parameter α often needs to be large (e.g., in the Wikipedia dataset), which requires moderate tuning. Optimizer and learning rate scheduler: For smaller vision tasks, we use SGD with momentum for optimization and the cosine learning rate scheduler (Loshchilov & Hutter, 2017). For the Image Net task and natural language task, we use Adam with weight decay (Loshchilov & Hutter, 2018) and the linear decay scheduler. Table 7 gives common choices of hyperparameters for different datasets. |