Learning Scalable Deep Kernels with Recurrent Structure
Authors: Maruan Al-Shedivat, Andrew Gordon Wilson, Yunus Saatchi, Zhiting Hu, Eric P. Xing
JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate state-of-the-art performance on several benchmarks, and thoroughly investigate a consequential autonomous driving application, where the predictive uncertainties provided by GP-LSTM are uniquely valuable. [...] and present an extensive empirical evaluation of our model. Specifically, we apply our model to a number of tasks, including system identification, energy forecasting, and self-driving car applications. Quantitatively, the model is assessed on the data ranging in size from hundreds of points to almost a million with various signal-to-noise ratios demonstrating state-of-the-art performance and linear scaling of our approach. |
| Researcher Affiliation | Academia | Maruan Al-Shedivat EMAIL Carnegie Mellon University Andrew Gordon Wilson EMAIL Cornell University Yunus Saatchi EMAIL Zhiting Hu EMAIL Carnegie Mellon University Eric P. Xing EMAIL Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 Semi-stochastic alternating gradient descent. [...] Algorithm 2 Semi-stochastic asynchronous gradient descent. |
| Open Source Code | Yes | We release our code as a library at: http://github.com/alshedivat/keras-gp. This library implements the ideas in this paper as well as deep kernel learning (Wilson et al., 2016a) via a Gaussian process layer that can be added to arbitrary deep architectures and deep learning frameworks, following the Keras API specification. |
| Open Datasets | Yes | In the first set of experiments, we used publicly available nonlinear system identification datasets: Actuator6 (Sjöberg et al., 1995) and Drives7 (Wigren, 2010). [...] The smart grid data were taken from Global Energy Forecasting Kaggle competitions organized in 2012. [...] The dataset is proprietary. It was released in part for public use under the Creative Commons Attribution 3.0 license: http://archive.org/details/comma-dataset. |
| Dataset Splits | Yes | For the smart grid prediction tasks we used LSTM and GP-LSTM models with 48 hour time lags and were predicting the target values one hour ahead. LSTM and GP-LSTM were trained with one or two layers and 32 to 256 hidden units. The best models were selected on 25% of the training data used for validation. For autonomous driving prediction tasks, we used the same architectures but with 128 time steps of lag (1.28 s). [...] We considered the data from the first trip for training and from the second trip for validation and testing. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) are mentioned for the experiments. The paper discusses scalability in terms of time per epoch and time per test point, but not the underlying hardware. |
| Software Dependencies | No | Recurrent parts of each model were implemented using Keras11 library. We extended Keras with the GP layer and developed a backed engine based on the GPML library12. Our approach allows us to take full advantage of the functionality available in Keras and GPML, e.g., use automatic differentiation for the recurrent part of the model. Our code is available at http://github.com/alshedivat/keras-gp/. 11. http://www.keras.io 12. http://www.gaussianprocess.org/gpml/code/matlab/doc/ While Keras and GPML are mentioned, specific version numbers are not provided, which is necessary for a reproducible description of software dependencies. |
| Experiment Setup | Yes | For both smart grid prediction tasks we used LSTM and GP-LSTM models with 48 hour time lags and were predicting the target values one hour ahead. LSTM and GP-LSTM were trained with one or two layers and 32 to 256 hidden units. The best models were selected on 25% of the training data used for validation. For autonomous driving prediction tasks, we used the same architectures but with 128 time steps of lag (1.28 s). All models were regularized with dropout (Srivastava et al., 2014; Gal and Ghahramani, 2016b). [...] The LSTM architecture was the same as described in the previous section: it was transforming multi-dimensional sequences of inputs to a two-dimensional representation. We trained the model for 10 epochs on 10%, 20%, 40%, and 80% of the training set with 100, 200, and 400 inducing points per dimension and measured the average training time per epoch and the average prediction time per testing point. Table 5: Summary of the feedforward and recurrent neural architectures and the corresponding hyperparameters used in the experiments. GP-based models used the same architectures as their non-GP counterparts. Activations are given for the hidden units; vanilla neural nets used linear output activations. |