reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reliable Algorithm Selection for Machine Learning-Guided Design

Authors: Clara Fannjiang, Ji Won Park

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first demonstrate that our method selects successful configurations with high probability, as guaranteed by theory, when the design and labeled densities are known. Next, we show that it still selects successful configurations more effectively than alternative methods when these densities are estimated. Code for these experiments is at https://github.com/clarafy/design-algorithm-selection. Two metrics are of interest: error rate and selection rate. Error rate is the empirical frequency at which a method selects a configuration that fails the success criterion (Eq. 1), over multiple trials of sampling designs from each configuration as well as held-out labeled data (for methods that require it). Selection rate is the empirical frequency over those same trials at which a method selects anything at all. A good method achieves a low error rate while maintaining a high selection rate, which may be challenging for ambitious success criteria.
Researcher Affiliation	Industry	Clara Fannjiang 1 Ji Won Park 1 1Prescient Design, Genentech. Correspondence to: Clara Fannjiang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Design algorithm selection by multiple hypothesis testing. Algorithm 2 Prediction-powered p-value for testing Hλ : θλ := EY PY ;λ[g(Y )] < τ. Algorithm 3 Prediction-powered p-value for testing Hλ : θλ := EY PY ;λ[g(Y )] < τ (finite sample-valid). Algorithm 4 PPMEANLB: Prediction-powered confidence lower bound on θλ := EY PY ;λ[g(Y )] (finite sample-valid). Algorithm 5 MEANLB: Confidence lower bound on a mean (finite sample-valid; Waudby-Smith & Ramdas (2023)). Algorithm 6 Prediction-only p-value. Algorithm 7 Conformal prediction-based method for design algorithm selection. Algorithm 8 SPLITCONFORMALLB: split conformal lower bound for a design label.
Open Source Code	Yes	Code for these experiments is at https://github.com/clarafy/design-algorithm-selection.
Open Datasets	Yes	For the labels, we used a data set that contains experimentally measured binding affinities for every sequence in X (Wu et al., 2016) that is, all 204 variants of protein GB1 at 4 specific sites.
Dataset Splits	Yes	For the prediction-only method and GMMForecasts, which do not need held-out labeled data, we trained the binding affinity predictive model on 10k labeled sequences. [...] For our method and Calibrated Forecasts, which use held-out labeled data, we trained the predictive model on 5k labeled sequences and solved for qλ, λ Λ. For each of T = 500 trials, we sampled n = 5k additional labeled sequences, which were used to run both methods along with N = 1M designs from each qλ.
Hardware Specification	No	The paper does not explicitly describe the hardware used for its experiments. It mentions performing simulations and training models, but no specific details about GPUs, CPUs, or other computing resources are provided.
Software Dependencies	No	The paper mentions various methods and algorithms such as 'Adam' for optimization, 'isotonic regression', 'multinomial logistic regression-based density ratio estimation (MDRE)', 'ridge regression', 'Gaussian mixture model (GMM)', and 'variational autoencoder (VAE)'. However, it does not provide specific version numbers for any of the software libraries or frameworks used to implement these methods (e.g., Python, TensorFlow, PyTorch, scikit-learn versions).
Experiment Setup	Yes	The three predictive models were a ridge regression model, where the ridge regularization hyperparameter was set by leave-one-out cross-validation; an ensemble of three fully connected neural networks, each with two 100-unit hidden layers; and an ensemble of three convolutional neural networks, each with three convolutional layers with 32 filters, followed by two 100-unit hidden layers. Each model in both ensembles was trained for five epochs using Adam with a learning rate of 10-3. For Ada Lead (Sinai et al., 2020), the values of the threshold hyperparameter on the menu were κ {0.2, 0.15, 0.1, 0.05, 0.01}.