How Deep Are Deep Gaussian Processes?

Authors: Matthew M. Dunlop, Mark A. Girolami, Andrew M. Stuart, Aretha L. Teckentrup

JMLR 2018 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also describe numerical experiments which illustrate the theory, and which demonstrate some of the limitations of the framework in the inference context, suggesting the need for further algorithmic innovation and theoretical understanding.
Researcher Affiliation Academia Matthew M. Dunlop EMAIL Computing and Mathematical Sciences Caltech Pasadena, CA 91125, USA Mark A. Girolami EMAIL Department of Mathematics Imperial College London London, SW7 2AZ, UK and The Alan Turing Institute 96 Euston Road London, NW1 2DB, UK Andrew M. Stuart EMAIL Computing and Mathematical Sciences Caltech Pasadena, CA 91125, USA Aretha L. Teckentrup EMAIL School of Mathematics University of Edinburgh Edinburgh, EH9 3FD, UK and The Alan Turing Institute 96 Euston Road London, NW1 2DB, UK
Pseudocode Yes Algorithm 1 Non-Centred Algorithm 1. Fix β0, . . . , βN 1 (0, 1] and define B = diag(βj). Choose initial state ξ(0) X, and set u(0) = T(ξ(0)) X. Set k = 0. 2. Propose ˆξ(k) = (I B2) 1 2 ξ(k) + Bζ(k) j , ζ(k) N(0, I). 3. Set ξ(k+1) = ˆξ(k) with probability αk = min n 1, exp Φ T(ξ(k)); y Φ T(ˆξ(k)); y o ; otherwise set ξ(k+1) = ξ(k). 4. Set k 7 k + 1 and go to 1.
Open Source Code No The paper does not provide an explicit statement about open-source code availability or a link to a code repository.
Open Datasets No We consider first the case D = (0, 1), where the forward map is given by a number of point evaluations: Gj(u) = u(xj) for some sequence {xj}J j=1 D. We compare the quality of reconstruction versus both the number of point evaluations and the number of levels in the deep Gaussian prior. We use the same parameters for the family of covariance operators as in subsection 4.2. The base layer u0 is taken to be Gaussian with covariance of the form (15), with Γ(u) 202. The true unknown field u is given by the indicator function u = 1(0.3,0.7), shown in Figure 6. It is generated on a mesh of 400 points, and three data sets are created wherein it is observed on uniform grids of J = 25, 50 and 100 points, and corrupted by white noise with standard deviation γ = 0.02. Sampling is performed on a mesh of 200 points to avoid an inverse crime (Kaipio and Somersalo, 2006).
Dataset Splits No The paper describes generating custom data on specified grids and points, but does not provide training/test/validation splits. It refers to "data sets created" for observation points, but not for typical ML model training/evaluation.
Hardware Specification No The paper does not explicitly describe the hardware used for its experiments. It mentions using MATLAB for sampling.
Software Dependencies No In Figure 1, we show four independent realizations of the first seven layers u0, . . . , u6, where u0 is taken as a sample of the stationary Gaussian process with correlation kernel ρS. The domain D is here chosen as the interval (0, 1), and the sampling points are given by the uniform grid xi = i 1 256 , for i = 1, . . . , 257. Each column in Figure 1 corresponds to one realization, and each row corresponds to a given layer un, the first row showing u0. We can clearly see the non-stationary behaviour in the samples when progressing through the levels. We note that the ergodicity of the chain is also reflected in the samples, with the distribution of the samples un looking similar for larger values of n. Figure 2 shows the same information as Figure 1, in the case where the domain D is (0, 1)2 and the sampling points are the tensor product of the one-dimensional points x1 i = i 1 64 , for i = 1, . . . , 65. To generate the samples, we use the command mvnrnd in MATLAB, and when plotting the samples, we use linear interpolation.
Experiment Setup Yes For numerical experiments, we take F(u) = min{F + aebu2, F+} for some F+, F , a, b > 0. In particular, in one spatial dimension we take F+ = 1502, F = 200, a = 100 and b = 2. In two dimensions, we take F+ = 1502, F = 50, a = 25 and b = 0.3. We take α = 4 in both cases, and choose σ such that E u(x)2 1. Sampling is performed on a mesh of 200 points to avoid an inverse crime (Kaipio and Somersalo, 2006). 106 samples are generated per chain, with the first 2 105 discarded as burn-in when calculating means. The jump parameters βj are adaptively tuned to keep acceptance rates close to 30%. It is generated on a uniform square mesh of 214 points, and two data sets are created wherein it is observed on uniform square grid of J = 210, 28 points, and corrupted by white noise with standard deviation γ = 0.02. Sampling is performed on a mesh of 212 points to again avoid an inverse crime. 4 105 samples are generated per chain, with the first 2 105 discarded as burn-in when calculating means. Again the jump parameters βj are adaptively tuned to keep acceptance rates close to 30%.