reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Continuously Updating Digital Twins using Large Language Models

Authors: Harry Amad, Nicolás Astorga, Mihaela Van Der Schaar

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We now demonstrate the empirical performance of CALM-DT. Firstly, we examine simulations in fixed modelling environments, demonstrating state-of-the-art performance (6.1). We also conduct ablation studies to assess the contribution of different components of CALM-DT (6.2). We then showcase CALM-DT s unique ability to adapt to changes in modelling environment without re-design or re-training, demonstrating adaptation to a novel action (6.3), and incorporation of new data (6.4).
Researcher Affiliation	Academia	1Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, United Kingdom. Correspondence to: Harry Amad <EMAIL>.
Pseudocode	No	The paper describes the methodology of CALM-DT in Section 4, detailing an iterative three-stage process: information retrieval, prompt formulation, and generation. Figure 2 provides a visual overview. However, there are no explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured code-like steps.
Open Source Code	No	The paper does not provide concrete access to source code for the CALM-DT methodology. It mentions the GitHub repository for a benchmark method (HDTwin) from Holt et al. (2024), but not for the authors' own work.
Open Datasets	Yes	For the CF setting, we use 1000 trajectories from the 2008-2013 UK CF registry for training... Since the CF data is not publicly accessible... We split the Hare-Lynx dataset... using the datasets from Bonnaff e & Coulson (2023). We split the Algae-Flagellate-Rotifer dataset... using the datasets from Bonnaff e & Coulson (2023).
Dataset Splits	Yes	For the CF setting, we use 1000 trajectories from the 2008-2013 UK CF registry for training, and assess three-year simulation performance. For the NSCLC setting, we generate 500 training samples... We generate validation and testing sets of 100 patients each. We split the Hare-Lynx dataset into nine samples of 10 years each, and we set the first six samples as the training set, and use last three samples as the testing set... We split the Algae-Flagellate-Rotifer dataset into 10 samples of 10 days each, and we set the first six samples as the training set, and use last four samples as the testing set.
Hardware Specification	No	The paper mentions using GPT-4o, GPT-4o Mini, or GPT-3.5 Turbo via the Azure Open AI Service. While these are specific models/services, they do not specify the underlying hardware (e.g., GPU models, CPU types) used to run the experiments.
Software Dependencies	Yes	For CALM-DT, we use GPT-4o, accessed via the Azure Open AI Service with version 2024-02-01... GPT-4o mini (version 2024-10-01-preview), or GPT-3.5 Turbo (version 2024-10-01-preview), all accessed via the Azure Open AI service.
Experiment Setup	Yes	For CALM-DT, we use GPT-4o... with the temperature τ = 0, and we set Kτ as:... We also set r = 1, l = 3, F = 3, c = 5... We conduct training for 8 epochs with a batch size of 16, learning rate of 5 10 5, and temperature of τ = 0.07, using the Adam W optimizer (Kingma & Ba, 2015) as implemented in Py Torch (Paszke et al., 2019)... Dy NODE with a 3-layer MLP, with a hidden dimension of 128, with tanh activation functions, and Xavier weight initialisation... learning rate of 0.01, batch size of 1,000 and early stopping with a patience of 20 for 2,000 epochs.