Can Optimization Trajectories Explain Multi-Task Transfer?

Authors: David Mueller, Mark Dredze, Nicholas Andrews

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we seek to improve our understanding of these failures by empirically studying how MTL impacts the optimization of tasks, and whether this impact can explain the effects of MTL on generalization. We show that MTL results in a generalization gap a gap in generalization at comparable training loss between single-task and multi-task trajectories early into training. However, we find that factors of the optimization trajectory previously proposed to explain generalization gaps in single-task settings cannot explain the generalization gaps between single-task and multi-task models. Moreover, we show that the amount of gradient conflict between tasks is correlated with negative effects to task optimization, but is not predictive of generalization. Our work sheds light on the underlying causes for failures in MTL and, importantly, raises questions about the role of general purpose multi-task optimization algorithms. We release code for all of our experiments and analysis here: https://github.com/davidandym/Multi-Task-Optimization
Researcher Affiliation Academia David Mueller EMAIL Department of Computer Science Johns Hopkins University Mark Dredze EMAIL Department of Computer Science Johns Hopkins University Nicholas Andrews EMAIL Department of Computer Science Johns Hopkins University
Pseudocode No The paper describes methodologies and calculations in narrative text and mathematical formulas, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured, step-by-step instructions.
Open Source Code Yes We release code for all of our experiments and analysis here: https://github.com/davidandym/Multi-Task-Optimization
Open Datasets Yes Fashion MTL is a synthetic MTL setting that we construct from the Fashion MNIST task (Xiao et al., 2017)... The MNISTS multi-task setting (Hsieh & Chen, 2018) consists of 3 MNIST-like tasks: MNIST (Le Cun et al., 1998)... Fashion MNIST (Xiao et al., 2017)... CIFAR-100 (Krizhevsky, 2012)... Celeb A (Liu et al., 2015)... The Cityscapes (Cordts et al., 2016) dataset... The GLUE dataset (Wang et al., 2018) is a benchmark of 8 NLP tasks.
Dataset Splits Yes Fashion MTL: We randomly split the Fashion MNIST dataset into two splits of 25, 000 samples each... For each task we have 5, 000 validation and test samples. MNISTS: Each task has exactly 50, 000 training samples and 5, 000 validation and test samples. CIFAR-100: Each task consists of 5, 000 training samples (roughly 1, 000 samples per-class), and 500 validation and test samples. Celeb A: Each task consists of 162, 700 training samples with 19, 867 validation and 19, 962 test samples. Cityscapes: The dataset consists of 2, 975 train images, 500 validation images, and 1, 525 test images
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments. It does not mention specific GPU models, CPU models, or other computing resources.
Software Dependencies No The paper mentions using 'Adam optimizer' and 'pre-trained Ro BERTa-Base' and 'Deep Lab V3 architecture', but it does not provide specific version numbers for any software libraries, frameworks, or languages used in the experiments.
Experiment Setup Yes For every training trajectory we study, we consider 3 random seeds after selecting hyper-parameters based on the best validation performance out of an initial hyper-parameter sweep. To maintain comparability of individual task trajectories, single-task and multi-task models within a single MTL setting are trained for the same number of steps, with the same optimizer and C (the scaling factor) set equal to 1. For all settings, we conduct a hyperparameter sweep over the learning rate: {10 1, 50 1, 10 2, 50 2, 10 3, 50 3, 10 4, 50 4, 10 5} and batch size: {4, 16, 32, 64, 128, 256}. We use the Adam optimizer over all settings... In all settings we use a constant learning rate (no decay). We set |B| = 128 and ρ = 10 3, and we truncate Sk to be of size 2048 [for sharpness calculation]. In practice, we calculate the FIM Trace over a mini-batch of size |B| = 16... Additionally, we truncate the size of the datasets Sk to be of size 2048.