reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Variational Search Distributions

Authors: Dan Steinberg, Rafael Oliveira, Cheng Soon Ong, Edwin Bonilla

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We derive asymptotic convergence rates for learning the true conditional generative distribution of designs with certain configurations of our method. After illustrating the generative model on images, we empirically demonstrate that VSD can outperform existing baseline methods on a set of real sequence-design problems in various protein and DNA/RNA engineering tasks.
Researcher Affiliation	Academia	Daniel M. Steinberg, Rafael Oliveira, Cheng Soon Ong & Edwin V. Bonilla Data61, CSIRO, Australia EMAIL
Pseudocode	Yes	The complete VSD algorithm is given in Algorithm 1 and depicted in Figure B.1.
Open Source Code	Yes	For the code implementing the models and experiments in this paper, please see https://github.com/csiro-funml/variationalsearch.
Open Datasets	Yes	The first of these tasks (Sec. 4.2) is to generate as many unique, fit sequences as possible using the datasets DHFR (Papkou et al., 2023), Trp B (Johnston et al., 2024) and TFBIND8 (Barrera et al., 2016). These datasets contain near complete evaluations of X, and to our knowledge DHFR and Trp B are novel in the machine learning literature. The second (Sec. 4.3) is a more traditional black-box optimization task of finding the maximum of an unknown function; using datasets AAV (Bryant et al., 2021), GFP (Sarkisyan et al., 2016) and the biologically inspired Ehrlich functions (Stanton et al., 2024).
Dataset Splits	No	The paper mentions initial training data sizes for CPEs (e.g., "randomly selected Ntrain = 2000") and mentions using "combinatorially (near) complete datasets," but it does not specify explicit train/validation/test splits (e.g., percentages or exact counts for each split) for the overall datasets used in the main experiments, or how the dataset is partitioned for evaluating the reported metrics like precision and recall.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU or CPU models, memory specifications, or cloud computing instance types.
Software Dependencies	No	The paper mentions using Adam for optimization, PyTorch syntax for CPE architectures, and the POLI and POLI-BASELINES software package. However, no specific version numbers are provided for PyTorch or any other critical software libraries, which is essential for a reproducible description of ancillary software.
Experiment Setup	Yes	For the biological sequence experiments we run a predetermined number of experimental rounds, T = 10 or 32 for the Ehrlich functions. We set the batch size to B = 128, and use five different seeds for random initialization. ... We set τ to be that of the wild-type sequences in the DHFR and Trp B datasets, and use τ = 0.75 for TFBIND8. ... AAV: p0 = 0.8, η = 0.7, ymax = 5, GFP: p0 = 0.8, η = 0.7 ymax = 1.9. We aim for p T = 0.99. ... For the Ehrlich function experiment... B = 128, T = 32... with p0 = 0.5 and η = 0.87 (so p T = 0.99). ... We optimize VSD, Cb AS, Db AS and BORE for a minimum of 3000 iterations each round (5000 for all experiments but the Ehrlich functions) using Adam (Kingma & Ba, 2014).