Stochastic Gradient Descent as Approximate Bayesian Inference
Authors: Stephan Mandt, Matthew D. Hoffman, David M. Blei
JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 7. Experiments We test our theoretical assumptions from Section 3 and find good experimental evidence that they are reasonable in some settings. We also investigate iterate averaging and show that the assumptions outlined in 6.2 result in samples from a close approximation to the posterior. We also compare against other approximate inference algorithms, including SGLD (Welling and Teh, 2011), NUTS (Hoffman and Gelman, 2014), and black-box variational inference (BBVI) using Gaussian reparametrization gradients (Kucukelbir et al., 2015). In Section 7.3 we show that constant SGD lets us optimize hyperparameters in a Bayesian model. |
| Researcher Affiliation | Collaboration | Stephan Mandt EMAIL Data Science Institute Department of Computer Science Columbia University New York, NY 10025, USA Matthew D. Hoffman EMAIL Adobe Research Adobe Systems Incorporated 601 Townsend Street San Francisco, CA 94103, USA David M. Blei EMAIL Department of Statistics Department of Computer Science Columbia University New York, NY 10025, USA |
| Pseudocode | Yes | Algorithm 1 The Iterate Averaging Stochastic Gradient sampler (IASG) input: averaging window T = N/S, number of samples M, input for SGD. for t = 1 to M T do θt = θt 1 ϵ ˆg S(θt 1); // perform an SGD step; if t mod T = 0 then T PT 1 t =0 θt t ; // average the T most recent iterates end end output: return samples {µ1, . . . , µM}. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or provide a link to a code repository. |
| Open Datasets | Yes | Real-world data. We first considered the following data sets. The Wine Quality Data Set3, containing N = 4, 898 instances, 11 features, and one integer output variable (the wine rating). A data set of Protein Tertiary Structure4, containing N = 45, 730 instances, 8 features and one output variable. The Skin Segmentation Data Set5, containing N = 245, 057 instances, 3 features, and one binary output variable. ...To this end, we experimented with a Bayesian multinomial logistic (a.k.a. softmax) regression model with normal priors. ...Real-world data. In all experiments, we applied this model to the MNIST dataset (60, 000 training examples, 10, 000 test examples, 784 features) and the cover type dataset (500, 000 training examples, 81, 012 testing examples, 54 features). |
| Dataset Splits | Yes | Real-world data. In all experiments, we applied this model to the MNIST dataset (60, 000 training examples, 10, 000 test examples, 784 features) and the cover type dataset (500, 000 training examples, 81, 012 testing examples, 54 features). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper does not list any specific software libraries or tools with their version numbers. |
| Experiment Setup | Yes | We rescaled the feature to unit length and used a mini-batch of size S = 100, S = 100 and S = 10000 for the three data sets, respectively. The quadratic regularizer was 1. The constant learning rate was adjusted according to Eq. 15. ...For IASG and SGLD we used a minibatch size of S = 10 and an averaging window of N/S = 1000. The constant learning rate of IASG was ϵ = 0.003 and for SGLD we decreased the learning rate according to the Robbins-Monro schedule of ϵt = ϵ0 1000+t where we found ϵ0 = 10 3 to be optimal. |