An Inertial Newton Algorithm for Deep Learning

Authors: Camille Castera, Jérôme Bolte, Cédric Févotte, Edouard Pauwels

JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental From an empirical viewpoint, we show that INNA returns competitive results with respect to state of the art (stochastic gradient descent, ADAGRAD, ADAM) on popular deep learning benchmark problems. ... Section 5 describes experimental DL results on synthetic and real data sets (MNIST, CIFAR-10, CIFAR-100).
Researcher Affiliation Academia Camille Castera EMAIL IRIT, Université de Toulouse, CNRS Toulouse, France. Jérôme Bolte EMAIL Toulouse School of Economics Université de Toulouse Toulouse, France. Cédric Févotte EMAIL IRIT, Université de Toulouse, CNRS Toulouse, France. Edouard Pauwels EMAIL IRIT, Université de Toulouse, CNRS DEEL, IRT Saint Exupery Toulouse, France.
Pseudocode Yes INNA in its general and practical form is summarized in Table 1.
Open Source Code Yes The INNA algorithm is available in Pytorch, Keras and Tensorflow: https://github.com/camcastera/Inna-for-Deep Learning/ (Castera, 2019).
Open Datasets Yes We train a DNN for classification using the three most common image data sets (MNIST, CIFAR-10, CIFAR-100) (Le Cun et al., 1998, Krizhevsky, 2009).
Dataset Splits Yes We split the data sets into 50, 000 images for training and 10, 000 for testing.
Hardware Specification No The paper mentions that "Part of the numerical experiments were done using the OSIRIM platform of IRIT". However, it does not provide specific hardware details such as GPU/CPU models, memory, or other detailed computer specifications.
Software Dependencies Yes For these experiments, we used Keras 2.2.4 (Chollet, 2015) with Tensorflow 1.13.1 (Abadi et al., 2016) as backend.
Experiment Setup Yes At each iteration k, we compute the approximation of J (θ) on a subset Bk {1, . . . , 50, 000} of size 32. ... For the other two algorithms (INNA and SGD), we use the classical step-size schedule γk = γ0 / sqrt(k + 1). ... We choose this γ0 using a grid-search: for each algorithm we select the initial step-size that most decreases the training error J after fifteen epochs (one epoch consisting in a complete pass over the data). ... Given an initialization of the weights θ0, we initialize ψ0 such that the initial velocity is in the direction of J (θ0). More precisely, we use ψ0 = (1 αβ)θ0 (β2 β) J (θ0).