Variational Search Distributions

Authors: Dan Steinberg, Rafael Oliveira, Cheng Soon Ong, Edwin Bonilla

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We derive asymptotic convergence rates for learning the true conditional generative distribution of designs with certain configurations of our method. After illustrating the generative model on images, we empirically demonstrate that VSD can outperform existing baseline methods on a set of real sequence-design problems in various protein and DNA/RNA engineering tasks.
Researcher Affiliation Academia Daniel M. Steinberg, Rafael Oliveira, Cheng Soon Ong & Edwin V. Bonilla Data61, CSIRO, Australia EMAIL
Pseudocode Yes The complete VSD algorithm is given in Algorithm 1 and depicted in Figure B.1.
Open Source Code Yes For the code implementing the models and experiments in this paper, please see https://github.com/csiro-funml/variationalsearch.
Open Datasets Yes The first of these tasks (Sec. 4.2) is to generate as many unique, fit sequences as possible using the datasets DHFR (Papkou et al., 2023), Trp B (Johnston et al., 2024) and TFBIND8 (Barrera et al., 2016). These datasets contain near complete evaluations of X, and to our knowledge DHFR and Trp B are novel in the machine learning literature. The second (Sec. 4.3) is a more traditional black-box optimization task of finding the maximum of an unknown function; using datasets AAV (Bryant et al., 2021), GFP (Sarkisyan et al., 2016) and the biologically inspired Ehrlich functions (Stanton et al., 2024).
Dataset Splits No The paper mentions initial training data sizes for CPEs (e.g., "randomly selected Ntrain = 2000") and mentions using "combinatorially (near) complete datasets," but it does not specify explicit train/validation/test splits (e.g., percentages or exact counts for each split) for the overall datasets used in the main experiments, or how the dataset is partitioned for evaluating the reported metrics like precision and recall.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU or CPU models, memory specifications, or cloud computing instance types.
Software Dependencies No The paper mentions using Adam for optimization, PyTorch syntax for CPE architectures, and the POLI and POLI-BASELINES software package. However, no specific version numbers are provided for PyTorch or any other critical software libraries, which is essential for a reproducible description of ancillary software.
Experiment Setup Yes For the biological sequence experiments we run a predetermined number of experimental rounds, T = 10 or 32 for the Ehrlich functions. We set the batch size to B = 128, and use five different seeds for random initialization. ... We set τ to be that of the wild-type sequences in the DHFR and Trp B datasets, and use τ = 0.75 for TFBIND8. ... AAV: p0 = 0.8, η = 0.7, ymax = 5, GFP: p0 = 0.8, η = 0.7 ymax = 1.9. We aim for p T = 0.99. ... For the Ehrlich function experiment... B = 128, T = 32... with p0 = 0.5 and η = 0.87 (so p T = 0.99). ... We optimize VSD, Cb AS, Db AS and BORE for a minimum of 3000 iterations each round (5000 for all experiments but the Ehrlich functions) using Adam (Kingma & Ba, 2014).