Multitask Online Mirror Descent

Authors: Nicolò Cesa-Bianchi, Pierre Laforgue, Andrea Paudice, Massimiliano Pontil

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce and analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks. We prove that the regret of MT-OMD is of order p 1 + σ2(N 1) T, where σ2 is the task variance according to the geometry induced by the regularizer, N is the number of tasks, and T is the time horizon. [...] Finally, we present experiments which support our theoretical findings. [...] In this section, we empirically compare the performance of Hedge-MT-OGD/EG against two natural alternatives: an independent-task approach (IT-OGD/EG) where the agents do not communicate, and a single-task approach (ST-OGD/EG) where a single model is learned and shared by all agents. [...] Online Gradient Descent. For this experiment, we use the Lenk dataset (Lenk et al., 1996; Argyriou et al., 2007). [...] Exponentiated Gradient. For our second experiment, we consider EMNIST, a classification dataset...
Researcher Affiliation Academia Nicolò Cesa-Bianchi EMAIL Department of Computer Science, University of Milan, Italy Pierre Laforgue EMAIL Department of Computer Science, University of Milan, Italy Andrea Paudice EMAIL Department of Computer Science, University of Milan, Italy Istituto Italiano di Tecnologia, Genova, Italy Massimiliano Pontil EMAIL Istituto Italiano di Tecnologia, Genova, Italy University College London, United Kingdom
Pseudocode Yes Algorithm 1 MT-OGD input: A positive definite matrix A RN N (interaction matrix), σ2 > 0 (task variance), D > 0 (comparators set diameter), ηt > 0 (learning-rate schedule) initialization: A = A Id RNd Nd, x1 = 0, y1 = A 1/2x1 Algorithm 2 MT-EG input: A positive definite matrix A RN N (interaction matrix), η > 0 (learning-rate) initialization: A = A Id RNd Nd, x1 = 1/d RNd, y1 = A 1/2x1
Open Source Code No The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes Online Gradient Descent. For this experiment, we use the Lenk dataset (Lenk et al., 1996; Argyriou et al., 2007). It consists of 2880 computer ratings in the range {1, 2, . . . , 10}, made by 180 individuals (the tasks) on the basis of 14 binary features. [...] Exponentiated Gradient. For our second experiment, we consider EMNIST, a classification dataset consisting of 62 classes (images of digits, small and capital letters).
Dataset Splits No For the Lenk dataset, the paper describes its composition ('2880 computer ratings', '180 individuals') but does not specify how it was split into training, validation, or test sets. For the EMNIST dataset, it states: 'To each task, we assigned 10 examples (5 positive, 5 negative) randomly chosen from the set of examples for that task.' This describes the task-specific data selection but not a standard train/test/validation split for the overall experiment.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies No The paper does not explicitly list any software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes Online Gradient Descent. For this experiment, we use the Lenk dataset [...]. We run Hedge-MT-OGD using the clique interaction matrix A = (1 + N)IN 11 and the square loss. For all algorithms, the value of η is set according to the optimal theoretical value, see Proposition 2. In Hedge-MT-OGD, the variance σ2 is learned in a set of 5 experts uniformly spaced over [0, 1]. For simplicity, we use D = 1 and compute the resulting Lipschitz constant accordingly. [...] Exponentiated Gradient. For our second experiment, we consider EMNIST, a classification dataset [...]. We considered the linear logistic regression and ran Hedge-MT-EG with the parameterized clique interaction matrix A(b) = (1 + b)IN b11 /N. The value of b is set according to the theoretical value (that depends on σ2, see Proposition 8), while σ2 is learned in a set of 5 experts uniformly spaced over [0, 1]. For all algorithms, the value of η is set according to the optimal theoretical values.