Stochastic Gradient Descent as Approximate Bayesian Inference

Authors: Stephan Mandt, Matthew D. Hoffman, David M. Blei

JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 7. Experiments We test our theoretical assumptions from Section 3 and find good experimental evidence that they are reasonable in some settings. We also investigate iterate averaging and show that the assumptions outlined in 6.2 result in samples from a close approximation to the posterior. We also compare against other approximate inference algorithms, including SGLD (Welling and Teh, 2011), NUTS (Hoffman and Gelman, 2014), and black-box variational inference (BBVI) using Gaussian reparametrization gradients (Kucukelbir et al., 2015). In Section 7.3 we show that constant SGD lets us optimize hyperparameters in a Bayesian model.
Researcher Affiliation Collaboration Stephan Mandt EMAIL Data Science Institute Department of Computer Science Columbia University New York, NY 10025, USA Matthew D. Hoffman EMAIL Adobe Research Adobe Systems Incorporated 601 Townsend Street San Francisco, CA 94103, USA David M. Blei EMAIL Department of Statistics Department of Computer Science Columbia University New York, NY 10025, USA
Pseudocode Yes Algorithm 1 The Iterate Averaging Stochastic Gradient sampler (IASG) input: averaging window T = N/S, number of samples M, input for SGD. for t = 1 to M T do θt = θt 1 ϵ ˆg S(θt 1); // perform an SGD step; if t mod T = 0 then T PT 1 t =0 θt t ; // average the T most recent iterates end end output: return samples {µ1, . . . , µM}.
Open Source Code No The paper does not contain any explicit statement about releasing source code or provide a link to a code repository.
Open Datasets Yes Real-world data. We first considered the following data sets. The Wine Quality Data Set3, containing N = 4, 898 instances, 11 features, and one integer output variable (the wine rating). A data set of Protein Tertiary Structure4, containing N = 45, 730 instances, 8 features and one output variable. The Skin Segmentation Data Set5, containing N = 245, 057 instances, 3 features, and one binary output variable. ...To this end, we experimented with a Bayesian multinomial logistic (a.k.a. softmax) regression model with normal priors. ...Real-world data. In all experiments, we applied this model to the MNIST dataset (60, 000 training examples, 10, 000 test examples, 784 features) and the cover type dataset (500, 000 training examples, 81, 012 testing examples, 54 features).
Dataset Splits Yes Real-world data. In all experiments, we applied this model to the MNIST dataset (60, 000 training examples, 10, 000 test examples, 784 features) and the cover type dataset (500, 000 training examples, 81, 012 testing examples, 54 features).
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud instance types) used for running the experiments.
Software Dependencies No The paper does not list any specific software libraries or tools with their version numbers.
Experiment Setup Yes We rescaled the feature to unit length and used a mini-batch of size S = 100, S = 100 and S = 10000 for the three data sets, respectively. The quadratic regularizer was 1. The constant learning rate was adjusted according to Eq. 15. ...For IASG and SGLD we used a minibatch size of S = 10 and an averaging window of N/S = 1000. The constant learning rate of IASG was ϵ = 0.003 and for SGLD we decreased the learning rate according to the Robbins-Monro schedule of ϵt = ϵ0 1000+t where we found ϵ0 = 10 3 to be optimal.