Multitask Online Mirror Descent
Authors: Nicolò Cesa-Bianchi, Pierre Laforgue, Andrea Paudice, Massimiliano Pontil
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce and analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks. We prove that the regret of MT-OMD is of order p 1 + σ2(N 1) T, where σ2 is the task variance according to the geometry induced by the regularizer, N is the number of tasks, and T is the time horizon. [...] Finally, we present experiments which support our theoretical findings. [...] In this section, we empirically compare the performance of Hedge-MT-OGD/EG against two natural alternatives: an independent-task approach (IT-OGD/EG) where the agents do not communicate, and a single-task approach (ST-OGD/EG) where a single model is learned and shared by all agents. [...] Online Gradient Descent. For this experiment, we use the Lenk dataset (Lenk et al., 1996; Argyriou et al., 2007). [...] Exponentiated Gradient. For our second experiment, we consider EMNIST, a classification dataset... |
| Researcher Affiliation | Academia | Nicolò Cesa-Bianchi EMAIL Department of Computer Science, University of Milan, Italy Pierre Laforgue EMAIL Department of Computer Science, University of Milan, Italy Andrea Paudice EMAIL Department of Computer Science, University of Milan, Italy Istituto Italiano di Tecnologia, Genova, Italy Massimiliano Pontil EMAIL Istituto Italiano di Tecnologia, Genova, Italy University College London, United Kingdom |
| Pseudocode | Yes | Algorithm 1 MT-OGD input: A positive definite matrix A RN N (interaction matrix), σ2 > 0 (task variance), D > 0 (comparators set diameter), ηt > 0 (learning-rate schedule) initialization: A = A Id RNd Nd, x1 = 0, y1 = A 1/2x1 Algorithm 2 MT-EG input: A positive definite matrix A RN N (interaction matrix), η > 0 (learning-rate) initialization: A = A Id RNd Nd, x1 = 1/d RNd, y1 = A 1/2x1 |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Online Gradient Descent. For this experiment, we use the Lenk dataset (Lenk et al., 1996; Argyriou et al., 2007). It consists of 2880 computer ratings in the range {1, 2, . . . , 10}, made by 180 individuals (the tasks) on the basis of 14 binary features. [...] Exponentiated Gradient. For our second experiment, we consider EMNIST, a classification dataset consisting of 62 classes (images of digits, small and capital letters). |
| Dataset Splits | No | For the Lenk dataset, the paper describes its composition ('2880 computer ratings', '180 individuals') but does not specify how it was split into training, validation, or test sets. For the EMNIST dataset, it states: 'To each task, we assigned 10 examples (5 positive, 5 negative) randomly chosen from the set of examples for that task.' This describes the task-specific data selection but not a standard train/test/validation split for the overall experiment. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper does not explicitly list any software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | Online Gradient Descent. For this experiment, we use the Lenk dataset [...]. We run Hedge-MT-OGD using the clique interaction matrix A = (1 + N)IN 11 and the square loss. For all algorithms, the value of η is set according to the optimal theoretical value, see Proposition 2. In Hedge-MT-OGD, the variance σ2 is learned in a set of 5 experts uniformly spaced over [0, 1]. For simplicity, we use D = 1 and compute the resulting Lipschitz constant accordingly. [...] Exponentiated Gradient. For our second experiment, we consider EMNIST, a classification dataset [...]. We considered the linear logistic regression and ran Hedge-MT-EG with the parameterized clique interaction matrix A(b) = (1 + b)IN b11 /N. The value of b is set according to the theoretical value (that depends on σ2, see Proposition 8), while σ2 is learned in a set of 5 experts uniformly spaced over [0, 1]. For all algorithms, the value of η is set according to the optimal theoretical values. |