reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deep Learning for Bayesian Optimization of Scientific Problems with High-Dimensional Structure

Authors: Samuel Kim, Peter Y Lu, Charlotte Loh, Jamie Smith, Jasper Snoek, Marin Soljacic

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate BO on a number of realistic problems in physics and chemistry, including topology optimization of photonic crystal materials using convolutional neural networks, and chemical property optimization of molecules using graph neural networks. On these complex tasks, we show that neural networks often outperform GPs as surrogate models for BO in terms of both sampling efficiency and computational cost.
Researcher Affiliation	Collaboration	Samuel Kim EMAIL Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Peter Y. Lu Department of Physics Massachusetts Institute of Technology Charlotte Loh Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Jamie Smith Google Research Jasper Snoek Google Research Marin Soljačić EMAIL Department of Physics Massachusetts Institute of Technology
Pseudocode	Yes	Algorithm 1 Bayesian optimization with auxiliary information 1: Input: Labelled dataset Dtrain = {(xn, zn, yn)}Nstart=5 n=1 2: for N = 5 to 1000 do 3: Train M: X Z on Dtrain 4: Form an unlabelled dataset, Xpool 5: Find x N+1 = arg maxx Xpool α (x; M, Dtrain) 6: Label the data z N+1 = g(x N+1), y N+1 = h(z N+1) 7: Dtrain = Dtrain (x N+1, z N+1, y N+1) 8: end for
Open Source Code	Yes	We make our datasets and code publicly available at https://github.com/samuelkim314/Deep BO
Open Datasets	Yes	Here we focus on the QM9 dataset (Ruddigkeit et al., 2012; Ramakrishnan et al., 2014), which consists of 133,885 small organic molecules along with their geometric, electronic, and thermodynamics quantities that have been calculated with DFT.
Dataset Splits	No	The paper uses an iterative Bayesian optimization process where the dataset Dtrain is incrementally built: "Dtrain = Dtrain (x N+1, z N+1, y N+1)". For validation, Appendix A.5.3 states: "we track various metrics of the model during BO on a validation dataset with 1000 randomly sampled data points." However, this does not describe the specific splits for the main training datasets (nanoparticle, photonic crystal, or the initial QM9 pool) or how they are partitioned for reproducibility.
Hardware Specification	Yes	All experiments were carried out on systems with NVIDIA Volta V100 GPUs and Intel Xeon Gold 6248 CPUs.
Software Dependencies	No	The paper mentions several software components like "Tensor Flow v1", "Adam optimizer", "GPy Opt library", "dlib library", "NLopt library", and "pycma library". It also mentions "Neural Tangents library". However, specific version numbers are not consistently provided for all these tools or the programming language (e.g., Python) itself.
Experiment Setup	Yes	Unless otherwise stated, we set NMC = 30. All BNNs other than the infinitely-wide networks are implemented in Tensor Flow v1. Models are trained using the Adam optimizer using the cosine annealing learning rate with a base learning rate of 10 3 (Loshchilov & Hutter, 2016). All hidden layers use Re LU as the activation function, and no activation function is applied to the output layer. In particular, we re-train the BNN using 10 epochs.