A Reliable Effective Terascale Linear Learning System
Authors: Alekh Agarwal, Oliveier Chapelle, Miroslav Dudík, John Langford
JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we will present the empirical evaluation of the system described so far. We begin by describing the datasets used, before evaluating the various properties of our framework both from a systems as well as a machine learning perspective. |
| Researcher Affiliation | Industry | Alekh Agarwal EMAIL Microsoft Research 641 Avenue of the Americas New York, NY 10011 Olivier Chapelle EMAIL Criteo 411 High Street Palo Alto, CA 94301 Miroslav Dudík EMAIL John Langford EMAIL Microsoft Research 641 Avenue of the Americas New York, NY 10011 |
| Pseudocode | Yes | Algorithm 1 Stochastic gradient descent algorithm on a single node using adaptive gradient update (Duchi et al., 2010; Mc Mahan and Streeter, 2010). Require: Invariance update function s (see Karampatziakis and Langford, 2011) w = 0, G = I for all (x, y) in training set do g w ℓ(w x; y) w w s(w, x, y)G 1/2g Gjj Gjj + g2 j for all j = 1, . . . , d end for Algorithm 2 Sketch of the proposed learning architecture Require: Data split across nodes for all nodes k do wk = result of stochastic gradient descent on the data of node k using Algorithm 1. Compute the weighted average w as in (2) using All Reduce. Start a preconditioned L-BFGS optimization from w. for t = 1, . . . , T do Compute gk the (local batch) gradient of examples on node k. Compute g = Pm k=1 gk using All Reduce. Add the regularization part in the gradient. Take an L-BFGS step. end for end for |
| Open Source Code | Yes | Our implementation is available as part of the open source project Vowpal Wabbit (Langford et al., 2007) and is summarized in Algorithm 2. |
| Open Datasets | Yes | The second dataset comes from the task of recognizing a human acceptor splice site (Sonnenburg and Franc, 2010). We consider this learning task because this is, as far as we know, the largest public data set for which subsampling is not an effective learning strategy. |
| Dataset Splits | Yes | We follow the same experimental protocol as in Sonnenburg and Franc (2010): we use the same training and test sets of respectively 50M and 4.6M samples. |
| Hardware Specification | No | We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. |
| Software Dependencies | No | Map Reduce (Dean and Ghemawat, 2008) and its open source implementation Hadoop4 have become the overwhelmingly favorite platforms for distributed data processing. |
| Experiment Setup | Yes | We ran 5 iterations of L-BFGS on the splice site data with 1000 nodes. On each node, we recorded for every iteration the time spent in All Reduce and the computing time defined as the time not spent in All Reduce. The regularization coefficient is chosen from a small set to obtain the best test performance. We used stochastic gradient descent with the learning rate in the t-th iteration as ηt = 1 / (L + γ). We tuned γ and L on a small subset of the data set. They suggest global minibatch sizes of no more than b n. On m nodes, each node accumulates gradients from b/m examples, then an All Reduce operation is carried out, yielding the mini-batch gradient, and each node performs a stochastic gradient update with the learning rate of the form... |