Maximum Principle Based Algorithms for Deep Learning

Authors: Qianxiao Li, Long Chen, Cheng Tai, Weinan E

JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we investigate the performance of E-MSA compared with the usual gradientbased approaches, namely stochastic gradient descent and its variants: Adagrad (Duchi et al., 2011) and Adam (Kingma and Ba, 2014). To illustrate key properties of E-MSA, we shall begin by investigating some synthetic examples. First, we consider a simple onedimensional function approximation problem where we want to approximate F(x) = sin(x) for x [ π, π] using a continuous time dynamical system. [...] Next, we consider a familiar supervised learning test problem: the MNIST data set (Le Cun, 1998) for handwritten digit recognition, with 55000 training samples and 10000 test samples. We employ a continuous dynamical system that resembles a (residual) convolution neural network (Le Cun and Bengio, 1995) when discretized. [...]
Researcher Affiliation Academia Qianxiao Li EMAIL Institute of High Performance Computing Agency for Science, Technology and Research 1 Fusionopolis Way, Connexis North, Singapore 138632 Long Chen EMAIL Peking University Beijing, China, 100080 Cheng Tai EMAIL Beijing Institute of Big Data Research and Peking University Beijing, China, 100080 Weinan E EMAIL Princeton University Princeton, NJ 08544, USA, Beijing Institute of Big Data Research and Peking University Beijing, China, 100080
Pseudocode Yes Algorithm 1: Basic MSA 1 Initialize: θ0 U ; 2 for k = 0 to #Iterations do 3 Solve Xθk t = f(t, Xθk t , θk t ), Xθk 0 = x; 4 Solve P θk t = x H(t, Xθk t , P θk t , θk t ), P θk T = Φ(Xθk T ); 5 Set θk+1 t = arg maxθ Θ H(t, Xθk t , P θk t , θ) for each t [0, T];
Open Source Code No The paper does not explicitly state that source code for the described methodology is made available, nor does it provide a link to a code repository.
Open Datasets Yes Next, we consider a familiar supervised learning test problem: the MNIST data set (Le Cun, 1998) for handwritten digit recognition, with 55000 training samples and 10000 test samples. [...] As a further test, we train the same model on a different data set, the fashion MNIST data set (Xiao et al., 2017), where we again observe similar phenomena (see Figure 4).
Dataset Splits Yes First, we consider a simple onedimensional function approximation problem where we want to approximate F(x) = sin(x) for x [ π, π] using a continuous time dynamical system. [...] A training and test set of 1000 samples each are used. (a) Loss function vs iterations for a good initialization, where weights are initialized with truncated random normal variables with standard deviation 0.1 and biases are initialized as constants equal to 0.1. We see that E-MSA has good convergence rate per iteration. (b) We use a poor initialization by setting all weights and biases to 0. We observe that gradient descent based methods tend to become stuck whereas E-MSA are better at escaping these slow manifolds, provided that ρ is well chosen (=1.0 in this case). Next, we consider a familiar supervised learning test problem: the MNIST data set (Le Cun, 1998) for handwritten digit recognition, with 55000 training samples and 10000 test samples.
Hardware Specification No Note that wall-clock times are compared on the CPU for fairness since we did not use a GPU implementation for the L-BFGS algorithm used to maximize the augmented Hamiltonian, hence the wall-clock time for E-MSA is expected to be improved. Nevertheless, we expect that more efficient Hamiltonian maximization algorithms must be developed for E-MSA to out-perform gradient-based methods in terms of wall-clock efficiency.
Software Dependencies No The paper mentions algorithms and methods like stochastic gradient descent, Adagrad, Adam, and L-BFGS method, but does not provide specific version numbers for any software libraries, frameworks, or dependencies used in their implementation or experiments.
Experiment Setup Yes We apply E-MSA with discretization size δ = 0.25 (giving 20 layers) and compute the Hamiltonian maximization step using 10 iterations of limited memory BFGS method (L-BFGS) (Liu and Nocedal, 1989). [...] We use a total of 10 layers (2 projections, 1 fully-connected and 7 residual layers with δ = 0.5, i.e T = 3.5). The model is trained with mini-batch sizes of 100 using E-MSA and gradient-descent based methods, namely SGD, Adagrad, and Adam. For E-MSA, we approximately solve the Hamiltonian maximization step using either 10 iterations of L-BFGS.