Integrated Variational Fourier Features for Fast Spatial Modelling with Gaussian Processes

Authors: Talay M Cheema, Carl Edward Rasmussen

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally in Section 5 we evaluate our method experimentally, showing significant speedup relative to SGPR in low dimensions, and competitive performance compared to other fast methods, with broader applicability. 5 Experiments We seek to show that IFF gives a significant speedup for large datasets in low dimensions, with a particular focus on spatial modelling. Amongst other fast sparse methods, we compare against VFF and B-Spline features. For spherical harmonics, learning independent lengthscales for each dimension is incompatible with precomputation. In any case, we found that we were unable to successfully learn reasonable hyperparameters with that method in our setting, except if the number of feature was very small. For a conventional (no pre-compute) sparse baseline, we use inducing points sampled according to the scheme of Burt et al. (2020a). For our synthetic experiments, we also used inducing points initialised using k-means and kept fixed. For the real-world spatial datasets, we also tested SKI, due to its reputation for fast performance, and its fairly robust implementation. 5.1 Synthetic datasets First we consider a synthetic setting where the assumptions of Theorem 4.2 hold. We sample from a GP with a Gaussian covariance function, and compare the speed of variational methods in 1 and 2 dimensions. We use a small (N = 10 000) dataset in order that we can easily evaluate the log marginal likelihood at the learnt hyperparameters. Where possible, we use the same (squared exponential) model for learning; for VFF, we use a Matérn-5/2 kernel in 1D, and a tensor product of Matérn-5/2 covariance functions, since this is the best approximation to a Gaussian kernel which is supported. Further details are in Appendix D. Additionally, in the 2D setting, we use both the naive set of features (a regular, rectangular grid), and the refined set of features described in Section 3. IFF generally has slightly lower gap to the marginal likelihood at the learnt optimum for any M than other fast variational methods (Figure 3, bottom row), but because the O(NM 2) work is done only once, it and the other fast sparse methods are much faster to run than inducing points (Figure 3, top two rows). Note the logarithmic time scale on the plots: for a specified threshold on the gap to the marginal likelihood, IFF is often around 30 times faster than using inducing points. 5.2 Real World Datasets We now compare training objective and test performance on three real-world spatial modelling datasets of increasing size, and of practical interest. We plot the root mean squared error (RMSE) and negative log predictive density (NLPD) on the test set along with the training objective and run time in Figures 5 and 6 using five uniformly random 80/20 train/test splits. For inducing points, we always use the method of Burt et al. (2020b). The time plotted is normalised per split against inducing points. Further training and dataset details are in Appendix D.
Researcher Affiliation Academia Talay M Cheema EMAIL Department of Engineering University of Cambridge Carl Edward Rasmussen Department of Engineering University of Cambridge
Pseudocode No The paper describes methods and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks. Procedural descriptions are in paragraph form.
Open Source Code No The paper states: "We include the code for the experiments and figures which can be referred to for full details." However, this statement is ambiguous and does not provide a specific repository link, an explicit code release statement for the authors' own code, or confirm its inclusion in clearly defined supplementary materials accessible to the public. It mentions using third-party tools like Tensorflow 2, gpflow, and gpytorch, but these are not the authors' own code release for their methodology.
Open Datasets Yes The precipitation dataset is a regularly gridded (in latitude and longitude) modelled precipitation normals in mm in the contiguous United States for 1 January 2021 (publicly available with further documentation at https://water.weather.gov/precip/download.php; note the data at the source is in inches). The temperature dataset is the change in mean land surface temperature ( C). over the year ending February 2021 relative to the base year ending February 1961 (publicly available from https://data.giss.nasa.gov/gistemp/maps). The house price dataset is a snapshot of house prices in England and Wales, which is not regularly gridded. We use a random 20% of the full dataset, and target the log price to compress the dynamic range. It is based on the publicly available UK house price index (https://landregistry.data.gov.uk/app/ukhpi), and we enclose the exact dataset we use.
Dataset Splits Yes We plot the root mean squared error (RMSE) and negative log predictive density (NLPD) on the test set along with the training objective and run time in Figures 5 and 6 using five uniformly random 80/20 train/test splits.
Hardware Specification No The paper states: "The experiments were generally run on CPU to avoid memory-related distortion of the results, with the exception of SKI, which was run on GPU since it depends on GPU execution for faster MVMs." This only refers to generic CPU and GPU hardware without providing specific model numbers or specifications.
Software Dependencies Yes We use the publicly available Tensorflow 2 implementation for B-splines.
Experiment Setup Yes For the synthetic experiment, we generated N = 10 000 data points in 1 and 2 dimensions by samping from a GP with a Gaussian or Matérn-5/2 covariance function, with unit (or identity) lengthscale, unit variance, and set the SNR to 0.774... We then fit each model plotted, training using LBFGS and using the same initialisation in each case... The initial values were lengthscales of 0.2, and unit signal and noise variances. In all cases, we normalise both the inputs and targets to unit mean and standard deviation in each dimension. Guided by the synthetic results, we use the full rectangular grid of frequencies for VFF, but use a spherical mask for IFF. We set ε as described in the main text. Comparably, for B-splines and VFF, we set the interval [a, b] as 0.1 wider than the data (that is, ad = minn,d xn,d 0.1, bd = maxn,d xn,d + 0.1); we use fourth order B-splines.