Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

Authors: Junru Wu, Yi Liang, feng han, Hassan Akbari, Zhangyang Wang, Cong Yu

NeurIPS 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Applying those gradient harmonization techniques to pre-training VATT on the How To100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts. (Section 4: Experiments, and performance tables like Table 1)
Researcher Affiliation Collaboration Junru Wu Texas A&M University EMAIL; Yi Liang Google Research EMAIL; Feng Han Google Research EMAIL; Hassan Akbari Google Research EMAIL; Zhangyang Wang University of Texas at Austin EMAIL; Cong Yu Celonis Inc. / Celo AI EMAIL
Pseudocode Yes Algorithm 1 Cross-Modality Gradient Realignment; Algorithm 2 Gradient-based Curriculum Learning
Open Source Code No The paper does not include any statement about releasing source code or provide a link to a code repository for its methodology.
Open Datasets Yes How To100M1[12] is a large-scale dataset of narrated videos... Audio Set1[24] is a large-scale audio-visual dataset... Youtube8M1[1] is a large-scale video classification dataset...
Dataset Splits No The paper mentions using subsets of datasets and sampling clips but does not provide specific numerical train/validation/test splits (e.g., percentages or exact counts) to reproduce the data partitioning.
Hardware Specification Yes Our framework is implemented in Tensorflow 2.8, and train with 256 TPUV3s, it took a total of 3 days to train our models.
Software Dependencies Yes Our framework is implemented in Tensorflow 2.8
Experiment Setup Yes Pre-training Hyperparameter: We strictly follow the setting in [11], pre-training VATT from scratch with Adam optimizer with an initial learning rate of 1e-4, 10k warmup steps, 500k steps in total, a batch size of 2048 and using a cosine learning rate scheduler to anneal the learning rate from 1e-4 to 5e-5.