Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning

Authors: Baijiong Lin, Feiyang Ye, Yu Zhang, Ivor Tsang

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To show the effectiveness and necessity of RW methods, theoretically, we analyze the convergence of RW and reveal that RW has a higher probability to escape local minima, resulting in better generalization ability. Empirically, we extensively evaluate the proposed RW methods to compare with twelve state-of-the-art methods on five image datasets and two multilingual problems from the XTREME benchmark to show that RW methods can achieve comparable performance with state-of-the-art baselines.
Researcher Affiliation Academia 1 Department of Computer Science and Engineering, Southern University of Science and Technology 2 Australian Artificial Intelligence Institute, University of Technology Sydney 3 Centre for Frontier AI Research, A*STAR 4 Peng Cheng Laboratory
Pseudocode Yes The training algorithms of both RW methods are summarized in Algorithm 1. The only difference between the RW methods and the existing works is the generation of loss/gradient weights (i.e., Line 7 in Algorithm 1).
Open Source Code Yes The implementations of the RW methods and the baseline methods are based on the open-source Lib MTL library (Lin & Zhang, 2022).
Open Datasets Yes On five Computer Vision (CV) datasets and two multilingual problems from the XTREME benchmark (Hu et al., 2020), we show that RW methods can consistently outperform EW and have competitive performance with existing SOTA methods. ... We consider three image classifications datasets: the Multi-MNIST (Sabour et al., 2017), the Multi-Fashion MNIST, and the Multi-(Fashion+MNIST) datasets (Lin et al., 2019). ... The NYUv2 dataset (Silberman et al., 2012) is an indoor scene understanding dataset... The XTREME benchmark (Hu et al., 2020) is a large-scale multilingual multi-task benchmark... The datasets used in the PI and POS tasks are the PAWS-X dataset (Yang et al., 2019) and Universal Dependency v2.5 treebanks (Nivre et al., 2020), respectively. ... The City Scapes dataset (Cordts et al., 2016) is a large-scale urban street scene understanding dataset... The Celeb A dataset (Liu et al., 2015) is a large-scale face attributes dataset... The Office-31 dataset (Saenko et al., 2010)... The Office-Home dataset (Venkateswara et al., 2017)
Dataset Splits Yes The Multi-MNIST dataset...we use 120K and 20K images for training and testing, respectively. ... The NYUv2 dataset...It contains 795 and 654 images for training and testing, respectively. ... The XTREME benchmark...The statistics for each language are summarized in Table 2. ... The City Scapes dataset...It contains 2,975 and 500 annotated images for training and test, respectively. ... The Celeb A dataset...It is split into three parts: 162,770, 19,867, and 19,962 images for training, validation, and testing, respectively. ... The Office-31 dataset...We randomly split the whole dataset with 60% for training, 20% for validation, and the rest 20% for testing. The Office-Home dataset...We make the same split as the Office-31 dataset.
Hardware Specification Yes All the experiments are conducted on one single NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies No The implementations of the RW methods and the baseline methods are based on the open-source Lib MTL library (Lin & Zhang, 2022). ... a pre-trained multilingual BERT (m BERT) model (Devlin et al., 2019) implemented via the open-source transformers library (Wolf et al., 2020)...
Experiment Setup Yes The SGD optimizer with the learning rate as 10 3 and the momentum as 0.9 is used for training, the batch size is set to 256, and the training epoch is set to 100. The cross-entropy loss is used for each task. ... For the NYUv2 dataset...The Adam optimizer (Kingma & Ba, 2015) with the learning rate as 10 4 and the weight decay as 10 5 is used for training and the batch size is set to 8. We use the cross-entropy loss, L1 loss, and cosine loss as the loss function... For each multilingual problem in the XTREME benchmark...The Adam optimizer with the learning rate as 2 10 5 and the weight decay as 10 8 is used for training and the batch size is set to 32. The cross-entropy loss is used for the two multilingual problems. ... We use the Adam optimizer with the learning rate as 10 4 and the weight decay as 10 5 and set the batch size to 128 for training. The cross-entropy loss is used for all tasks in both datasets.