The Benefit of Multitask Representation Learning

Authors: Andreas Maurer, Massimiliano Pontil, Bernardino Romera-Paredes

JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The purpose of the experiments is to compare MTL and LTL to independent task learning (ITL) in the simple setting of linear feature learning (or subspace learning)1. We wish to study the regime in which MTL/LTL learning is beneficial over ITL as a function of the number of tasks T and the sample size per task n. We consider noiseless linear binary classification tasks, namely halfspace learning. We generated the data in the following way. The ground truth weight vectors u1, . . . , u T are obtained by the equation ut = Dct, where ct RK is sampled from the uniform distribution on the unit sphere in RK, and the dictionary D Rd K is created by first sampling a d-dimension orthonormal matrix from the Haar measure, and then selecting the first K columns (atoms). We create all input marginals by sampling from the uniform distribution on the d radius sphere in Rd. For each task we sample n instances to build the training set, and 1000 instances for the test set. We train the methods with the hinge loss function h(z) := max{0, 1 z/c}, where c is the margin. We choose c = 2/ϵ, so that the true error relative to the best hypothesis is of order ϵ. We fixed the value of ϵ to be (K/n)1/2. For ITL we optimize that loss function constraining the ℓ2-norm of the weights, for MTL and LTL we constrain D to have a Frobenius norm less or equal than 1, and each ct is constrained to have an ℓ2 norm less or equal than 1. During testing we use the 0-1 loss. For example the task-average error is evaluated as i=1 1{sign( ut, xi ) = sign( ˆut, xi )} (11) where ˆut are the weight vectors learned by the assessed method.
Researcher Affiliation Academia Andreas Maurer EMAIL Adalbertstrasse 55, D-80799 M unchen, Germany Massimiliano Pontil EMAIL Istituto Italiano di Tecnologia, 16163, Genoa, Italy Department of Computer Science, University College London, WC1E 6BT, UK Bernardino Romera-Paredes EMAIL Department of Engineering Science, University of Oxford, OX1 3PJ, UK
Pseudocode No The paper describes methods and algorithms conceptually (e.g., "Multitask representation learning (MTRL) solves the optimization problem"), but it does not provide any structured pseudocode or algorithm blocks.
Open Source Code Yes The code used for the experiments presented in this section is available at http://romera-paredes.com/multitask-representation.
Open Datasets No The paper describes a process for generating synthetic data for its experiments: "We generated the data in the following way. The ground truth weight vectors u1, . . . , u T are obtained by the equation ut = Dct, where ct RK is sampled from the uniform distribution on the unit sphere in RK, and the dictionary D Rd K is created by first sampling a d-dimension orthonormal matrix from the Haar measure, and then selecting the first K columns (atoms). We create all input marginals by sampling from the uniform distribution on the d radius sphere in Rd." It does not mention using any pre-existing publicly available dataset nor does it provide a link to its generated data.
Dataset Splits Yes For each task we sample n instances to build the training set, and 1000 instances for the test set. We let d = 50, and vary T {5, 10, . . . , 150}, n {5, 10, . . . , 150} considering the cases K = 2 and K = 5.
Hardware Specification No The paper does not specify any particular hardware used for running the numerical experiments. It discusses experimental settings and results but omits details about the computing resources.
Software Dependencies No The paper does not provide specific software dependencies with version numbers. It mentions general machine learning approaches like neural networks, kernel methods, and convex optimization, but no specific libraries or tools with versions.
Experiment Setup Yes We train the methods with the hinge loss function h(z) := max{0, 1 z/c}, where c is the margin. We choose c = 2/ϵ, so that the true error relative to the best hypothesis is of order ϵ. We fixed the value of ϵ to be (K/n)1/2. For ITL we optimize that loss function constraining the ℓ2-norm of the weights, for MTL and LTL we constrain D to have a Frobenius norm less or equal than 1, and each ct is constrained to have an ℓ2 norm less or equal than 1. During testing we use the 0-1 loss.