reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning

Authors: Baijiong Lin, Feiyang Ye, Yu Zhang, Ivor Tsang

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To show the eﬀectiveness and necessity of RW methods, theoretically, we analyze the convergence of RW and reveal that RW has a higher probability to escape local minima, resulting in better generalization ability. Empirically, we extensively evaluate the proposed RW methods to compare with twelve state-of-the-art methods on ﬁve image datasets and two multilingual problems from the XTREME benchmark to show that RW methods can achieve comparable performance with state-of-the-art baselines.
Researcher Affiliation	Academia	1 Department of Computer Science and Engineering, Southern University of Science and Technology 2 Australian Artiﬁcial Intelligence Institute, University of Technology Sydney 3 Centre for Frontier AI Research, A*STAR 4 Peng Cheng Laboratory
Pseudocode	Yes	The training algorithms of both RW methods are summarized in Algorithm 1. The only diﬀerence between the RW methods and the existing works is the generation of loss/gradient weights (i.e., Line 7 in Algorithm 1).
Open Source Code	Yes	The implementations of the RW methods and the baseline methods are based on the open-source Lib MTL library (Lin & Zhang, 2022).
Open Datasets	Yes	On ﬁve Computer Vision (CV) datasets and two multilingual problems from the XTREME benchmark (Hu et al., 2020), we show that RW methods can consistently outperform EW and have competitive performance with existing SOTA methods. ... We consider three image classiﬁcations datasets: the Multi-MNIST (Sabour et al., 2017), the Multi-Fashion MNIST, and the Multi-(Fashion+MNIST) datasets (Lin et al., 2019). ... The NYUv2 dataset (Silberman et al., 2012) is an indoor scene understanding dataset... The XTREME benchmark (Hu et al., 2020) is a large-scale multilingual multi-task benchmark... The datasets used in the PI and POS tasks are the PAWS-X dataset (Yang et al., 2019) and Universal Dependency v2.5 treebanks (Nivre et al., 2020), respectively. ... The City Scapes dataset (Cordts et al., 2016) is a large-scale urban street scene understanding dataset... The Celeb A dataset (Liu et al., 2015) is a large-scale face attributes dataset... The Oﬃce-31 dataset (Saenko et al., 2010)... The Oﬃce-Home dataset (Venkateswara et al., 2017)
Dataset Splits	Yes	The Multi-MNIST dataset...we use 120K and 20K images for training and testing, respectively. ... The NYUv2 dataset...It contains 795 and 654 images for training and testing, respectively. ... The XTREME benchmark...The statistics for each language are summarized in Table 2. ... The City Scapes dataset...It contains 2,975 and 500 annotated images for training and test, respectively. ... The Celeb A dataset...It is split into three parts: 162,770, 19,867, and 19,962 images for training, validation, and testing, respectively. ... The Oﬃce-31 dataset...We randomly split the whole dataset with 60% for training, 20% for validation, and the rest 20% for testing. The Oﬃce-Home dataset...We make the same split as the Oﬃce-31 dataset.
Hardware Specification	Yes	All the experiments are conducted on one single NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies	No	The implementations of the RW methods and the baseline methods are based on the open-source Lib MTL library (Lin & Zhang, 2022). ... a pre-trained multilingual BERT (m BERT) model (Devlin et al., 2019) implemented via the open-source transformers library (Wolf et al., 2020)...
Experiment Setup	Yes	The SGD optimizer with the learning rate as 10 3 and the momentum as 0.9 is used for training, the batch size is set to 256, and the training epoch is set to 100. The cross-entropy loss is used for each task. ... For the NYUv2 dataset...The Adam optimizer (Kingma & Ba, 2015) with the learning rate as 10 4 and the weight decay as 10 5 is used for training and the batch size is set to 8. We use the cross-entropy loss, L1 loss, and cosine loss as the loss function... For each multilingual problem in the XTREME benchmark...The Adam optimizer with the learning rate as 2 10 5 and the weight decay as 10 8 is used for training and the batch size is set to 32. The cross-entropy loss is used for the two multilingual problems. ... We use the Adam optimizer with the learning rate as 10 4 and the weight decay as 10 5 and set the batch size to 128 for training. The cross-entropy loss is used for all tasks in both datasets.