Second-Order Stochastic Optimization for Machine Learning in Linear Time
Authors: Naman Agarwal, Brian Bullins, Elad Hazan
JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we present experimental evaluation for our theoretical results.7 We perform the experiments for a classification task over two labels using the logistic regression (LR) objective function with the ℓ2 regularizer. For all of the classification tasks we choose two values of λ: 1m , where m is the number of training examples. We perform the above classification tasks over four data sets: MNIST, Cover Type, Mushrooms, and Real SIM. Figure 2 displays the log-error achieved by Li SSA as compared to two standard first-order algorithms, SVRG and SAGA (Johnson and Zhang, 2013; Defazio et al., 2014), in terms of the number of passes over the data. Figure 3 presents the performance of Li SSA as compared to New Samp (Erdogdu and Montanari, 2015) and standard Newton s method with respect to both time and iterations. |
| Researcher Affiliation | Academia | Naman Agarwal EMAIL Computer Science Department Princeton University Princeton, NJ 08540, USA; Brian Bullins EMAIL Computer Science Department Princeton University Princeton, NJ 08540, USA; Elad Hazan EMAIL Computer Science Department Princeton University Princeton, NJ 08540, USA |
| Pseudocode | Yes | Algorithm 1 Li SSA: Linear (time) Stochastic Second-Order Algorithm Algorithm 2 Li SSA-Quad Algorithm 3 REPEATED HALVING Algorithm 4 Fast Quadratic Solver (FQS) Algorithm 5 ellipsoid Cover Algorithm 6 N-SVRG |
| Open Source Code | Yes | 7. Our code for Li SSA can be found here: https://github.com/brianbullins/lissa_code. |
| Open Datasets | Yes | We perform the above classification tasks over four data sets: MNIST, Cover Type, Mushrooms, and Real SIM. Table 2 provides details regarding the data sets chosen for the experiments. Data set m d Reference MNIST4-9 11791 784 Le Cun and Cortes (1998) Mushrooms 8124 112 Lichman (2013) Cover Type 100000 54 Blackard and Dean (1999) Real SIM 72309 20958 Mc Callum (1997) |
| Dataset Splits | No | The paper mentions scaling data points to unit norm in Section 7.1 ('To make sure our functions are scaled such that the norm of the Hessian is bounded, we scale the above data set points to unit norm.') but does not provide specific details on how the datasets (MNIST, Cover Type, Mushrooms, Real SIM) were split into training, validation, or test sets. |
| Hardware Specification | No | The paper describes experiments and compares performance across various algorithms and datasets, including plotting 'log(Error) as a function of the time elapsed' in Figure 4. However, it does not specify any particular hardware components such as CPU models, GPU models, memory specifications, or details of the computing environment used for these experiments. |
| Software Dependencies | No | The paper discusses various algorithms like SVRG, SAGA, Ada Grad, BFGS, gradient descent, and SGD, implying their implementation. However, it does not provide any specific software names with version numbers (e.g., programming languages like Python with version, or libraries/frameworks like PyTorch/TensorFlow with versions). |
| Experiment Setup | Yes | In Section 7.2, under 'Experiment Details' and 'Choice of Parameters for Li SSA', the paper states: 'For all of the classification tasks we choose two values of λ: 1m , where m is the number of training examples.'; 'We set the number of inner iterations of SVRG to 2m for the case when λ = 1/m, and we parameter tune the number of inner iterations when λ = 10/m. The stepsizes for all of the algorithms are parameter tuned by an exhaustive search over the parameters.'; 'For stochastic gradient descent, we use the variable step size ηt = γ/t which is usually the prescribed step size, and we hand tune the parameter γ.'; 'To pick the parameters for our algorithm, we observe that it exhibits smooth behavior even in the case of S1 = 1, so this is used for the experiments. However, we observe that increasing S2 has a positive effect on the convergence of the algorithm up to a certain point... As the theory predicts S2 to be of the order κ ln(κ), for our experiments we determine an estimate for κ and set S2 to around κ ln(κ).' |