reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration

Authors: Samuel Holt, Max Ruiz Luyten, Antonin Berthon, Mihaela Van Der Schaar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate G-Sim to verify that it can generate simulators with higher fidelity than existing discovery or data-driven world models. Our experiments use both GFO and SBI for calibration. Benchmark Environments. We evaluate G-Sim on three real-world-inspired simulation tasks that together capture (1) stochastic transitions, (2) rich, discrete state updates, and (3) partially observed states. Each task provides a dataset of state-action trajectories and a textual description of the environment, sampled from a carefully hand-designed simulator. [...] We evaluated all benchmark methods across the three environments, with results tabulated in Table 2. G-Sim consistently achieves the lowest Wasserstein distance on the held-out test data, indicating that its generated simulators model the ground-truth system dynamics with the highest fidelity. The performance gap is particularly pronounced in the complex Hospital Bed Scheduling task, where data-driven methods struggle significantly.
Researcher Affiliation	Academia	Samuel Holt * 1 Max Ruiz Luyten * 1 Antonin Berthon 1 Mihaela van der Schaar 1 1University of Cambridge. Correspondence to: Samuel Holt <EMAIL>.
Pseudocode	Yes	Algorithm 1 G-Sim: High-Level Pseudocode Require: Domain knowledge K (text descriptions, constraints), Training data D = {D(1), . . . , D(L)}, LLM with a prompt function Prompt LLM( ), Calibration engine Calibrate Params( ) (either GFO or SBI), Diagnostics function Diag(λ, ω; D), Maximum iterations G, patience for early stopping. Ensure: A fully calibrated simulator (λ , ω ) minimizing the diagnostic score. [...] In the following we detail the full methodology for G-Sim, including pseudocode, training procedures, prompt templates, and diagnostics-driven refinement. Our approach builds on the framework described in Section 3 of the main paper.
Open Source Code	Yes	Code is available at https://github.com/samholt/ generative-simulations and we provide a broader research group code base at https://github.com/ vanderschaarlab/generative-simulations
Open Datasets	No	Benchmark Environments. We evaluate G-Sim on three real-world-inspired simulation tasks that together capture (1) stochastic transitions, (2) rich, discrete state updates, and (3) partially observed states. Each task provides a dataset of state-action trajectories and a textual description of the environment, sampled from a carefully hand-designed simulator. [...] We generate state-action trajectories by simulating over a fixed horizon T: Initial state: We set inventory 20, pipeline empty, backlog = 0, and t = 0. Policy: For simplicity, an agent might follow a simple reorder policy (e.g., (s, S) policy or a constant order) or an ε-greedy approach. Alternatively, actions can be random to promote exploration. Stochastic demand: Each day, demand is drawn from Poisson(λdemand). After N = 100 simulated rollouts of T = 60, we collect (statet, actiont, statet+1) tuples to form a dataset.
Dataset Splits	Yes	Sampling procedure for dataset generation. To create training and evaluation datasets: [...] We repeat this process for N initial seeds, thereby obtaining N state-action trajectories of length T. We then split these trajectories into training, validation, and test sets (e.g., Ntrain = 100, Nval = 100, Ntest = 100). With each trajectory, we store the transitions st, at, st+1 for subsequent fitting and analysis. [...] We collect the resulting day-by-day trajectories of the state and produce train/validation/test splits for model calibration and evaluation.
Hardware Specification	Yes	All experiments and training were performed using a single Intel Core i9-12900K CPU @ 3.20GHz, 64GB RAM with an Nvidia RTX3090 GPU 24GB.
Software Dependencies	No	Implementation with Evo Torch. We implement the GFO step using the Genetic Algorithm class from Evo Torch. [...] We use the Neural Posterior Estimation (NPE) algorithm from the sbi library. [...] The code g e n e r a t e d should include the complete step f u n c t i o n body in Num Py , f u l l y f u n c t i o n a l , no p l a c e h o l d e r s .
Experiment Setup	Yes	Our key Evo Torch settings are: Population size: 200, Number of generations: 10, Search operators: Simulated Binary Cross Over with tournament size 4, crossover rate 1.0, and η = 8, Gaussian Mutation with standard deviation stdev=0.03. [...] Simulation Budget. Based on our implementation, we use a simulation budget of 1,000 simulations to train the SBI posterior estimator. [...] Hyperparameters for G-Sim. In our experiments, we typically use a maximum of 5 refinement loops, a patience of 3 for early stopping, a population size of 200 in evolutionary search, 10 generations, and a mutation rate of 0.03 for parameter changes.