reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design

Authors: Marcel Hedman, Desi R. Ivanova, Cong Guan, Tom Rainforth

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, Step DAD consistently demonstrates superior decisionmaking and robustness compared with current state-of-the-art BED methods. ... We empirically evaluate Step-DAD on a range of design problems, comparing its performance against DAD to determine the additional EIG achieved by the step policy πs over the fully amortized policy π0. We further consider several other baselines for comparison.
Researcher Affiliation	Academia	1Department of Statistics, University of Oxford. Correspondence to: Marcel Hedman <EMAIL>, Desi R. Ivanova <EMAIL>.
Pseudocode	Yes	Algorithm 1 Overview of Step-DAD Input: Generative model p(θ)p(y \| θ, ξ), experimental budget T, refinement schedule T ={τ0, τ1, . . . , τK+1}, with τ0=0, τK+1=T, training budgets {Nτk}k=1:K Output: Dataset h T = {(ξt, yt)}t=1:T OFFLINE STAGE: BEFORE THE LIVE EXPERIMENT Set h0 = . while Computational budget does not exceed N0 do Train fully-amortized π0 as in Foster et al. (2021) end ONLINE STAGE: DURING THE LIVE EXPERIMENT for k = 1, . . . , K + 1 do for τk 1 < t τk do Compute design ξt = πτk 1(ht 1) Run experiment ξt, observe an outcome yt Update the dataset ht = ht 1 (ξt, yt) end If k = K + 1 then return h T end while Computational budget does not exceed Nk do Fit a posterior p(θ \| hτk) Fine-tune policy πτk by optimizing (6) end end
Open Source Code	No	The experiments were conducted using Python and open-source tools. Py Torch (Paszke et al., 2019) and Pyro (Bingham et al., 2018) were employed to implement all estimators and models. Additionally, Ml Flow (Zaharia et al., 2018) was utilized for experiment tracking and management. No explicit statement of code release for the methodology described in this paper is provided.
Open Datasets	Yes	We first consider the source location finding experiment from Foster et al. (2021)... Using the hyperbolic discounting model introduced in Mazur (1987) and as implemented by Vincent (2016)... We conclude our evaluation with the Constant Elasticity of Substitution (CES) model, a framework from behavioral economics to analyse the relative utility of two baskets of goods (Arrow et al., 1961).
Dataset Splits	No	The paper describes a generative modeling approach where experimental histories are simulated from a model, rather than using fixed pre-split datasets. For example, it mentions 'experimental histories simulated from the model p(θ)p(h T \| θ, π)'. Thus, it does not provide conventional train/test/validation splits for a static dataset.
Hardware Specification	Yes	Experiments were performed on two separate GPU servers, one with 4x Ge Force RTX 3090 cards and 40 cpu cores; the other one with 10x A40 and 52 cpu cores.
Software Dependencies	No	The experiments were conducted using Python and open-source tools. Py Torch (Paszke et al., 2019) and Pyro (Bingham et al., 2018) were employed to implement all estimators and models. Additionally, Ml Flow (Zaharia et al., 2018) was utilized for experiment tracking and management. Specific version numbers for these software components are not provided, only citations to the papers introducing them.
Experiment Setup	Yes	DAD was trained for 50K steps, Step-DAD for 2.5K. ... Table 9: Source location finding. Parameters for training pre-training DAD/Step-DAD: Batch size 1024, Number of negative samples 1023, Number of gradient steps (default) 50K, Learning rate (LR) 0.0001. Table 10: Source location finding. Parameters for Step-DAD finetuning: Number of theta rollouts 16, Number of posterior samples 20K, Finetuning learning rate (LR) 0.0001.