reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MITIGATING OVER-EXPLORATION IN LATENT SPACE OPTIMIZATION USING LES

Authors: Omer Ronen, Ahmed Imtiaz Humayun, Richard Baraniuk, Randall Balestriero, Bin Yu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation across five LSO benchmark tasks and twenty-two VAE models demonstrates that LES always enhances the quality of the solutions while maintaining high objective values, leading to improvements over existing solutions in most cases. We evaluate LES-constrained LSO across thirty optimization tasks, including twenty-two VAEs and five benchmark problems, demonstrating its robustness in generating valid solutions and achieving high objective values. Specifically, in 19 out of the 30 LSO experiments, our method either finds the best solution on average or achieves a solution within 1 standard deviation of the best solution across 10 independent runs. This outperforms the six alternative methods we considered by 19% (Tables 20 and 21).
Researcher Affiliation	Academia	Omer Ronen 1 Ahmed Imtiaz Humayun 2 Richard Baraniuk 2 Randall Balestriero 3 Bin Yu 1 1UC Berkeley 2Rice Univeristy 3Brown University. Correspondence to: Omer Ronen <EMAIL>.
Pseudocode	Yes	Algorithm 1 Latent Space Optimization for t = 1 to T do 1. Fit a surrogate model ˆf to the encoded dataset, Dz 2. Generate a new batch of query points by optimizing a chosen acquisition function (A) z(new) = arg max z A ˆ f(z) (3) 3. Decode x(new) = Gθ(z(new)), evaluate the corresponding true objective values (y(new) = M(x(new))) and update Dz with (z(new), y(new)).
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described. It mentions third-party tools like 'rd filters' and 'Bo Torch' that were used.
Open Datasets	Yes	The VAEs for the Expressions dataset, sourced from Kusner et al. (2017)... The SMILES VAEs were trained on the ZINC250k dataset... For the SELFIES VAEs, we used a subset of approximately 200k molecules from the ZINC250k dataset... all of which are part of the Guacamol benchmarks (Brown et al., 2019).
Dataset Splits	No	The paper mentions training VAEs on 80k data points for Expressions, 250k for SMILES, and 200k for SELFIES, and using 500 or 1500 data points for initializing the Gaussian Process in LSO. However, it does not specify explicit training/validation/test splits for these datasets.
Hardware Specification	Yes	Table 1. Wall clock times in seconds (lower is better) for calculating LES, the Bayesian uncertainty and the Likelihood scores for a sample of 20 latent vectors on a single A100 GPU.
Software Dependencies	No	The paper mentions using the Adam optimizer (Kingma, 2014) and the Bo Torch package (Balandat et al., 2020), and PyTorch for automatic differentiation (Paszke et al., 2017). However, it does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	All models use a convolutional encoder based on the architecture proposed by Kusner et al. (2017), and were trained for 300 epochs using the Adam optimizer (Kingma, 2014) with a learning rate of 1e-3 and batch size of 256. We set λ = 0.05 for Expressions, λ = 0.1 for our SELFIES models, and λ = 0.5 for SMILES and the pre-trained SELFIES-VAE. For Ranolazine MPO with pre-trained SELFIES-VAE, we use λ = 0.1... The same step size is applied to all models within the same dataset: Expressions = 0.8, SMILES = 0.003, SELFIES = 0.03, and SELFIES pre-trained = 0.3. The LSO (L-BFGS) method has a single hyperparameter, the facet length, which is set to 5. For Tu RBO, there are three primary hyperparameters: the initial length, which we set to 0.8, along with the success and failure tolerances, determining when to expand or shrink the trust region, set at 10 and 2, respectively.