reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SysCaps: Language Interfaces for Simulation Surrogates of Complex Systems

Authors: Patrick Emami, Zhaonan Li, Saumya Sinha, Truc Nguyen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on two real-world simulators of buildings and wind farms show that our Sys Caps-augmented surrogates have better accuracy on held-out systems than traditional methods while enjoying new generalization abilities, such as handling semantically related descriptions of the same test system. Additional experiments also highlight the potential of Sys Caps to unlock language-driven design space exploration and to regularize training through prompt augmentation.
Researcher Affiliation	Academia	Patrick Emami, Saumya Sinha , Truc Nguyen National Renewable Energy Lab EMAIL Zhaonan Li Arizona State University EMAIL
Pseudocode	No	The paper describes the methods and model architecture using narrative text, figures (Figure 1, Figure 2), and mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	As there are no standard benchmarks for comparing surrogate modeling performance for CES, we open-source all code and data at https://github.com/NREL/SysCaps to facilitate future work.
Open Datasets	Yes	Building stock simulation data: For the main experiments in Section 6.1-6.4 we train building stock surrogate models for the building energy simulator Energy Plus (Crawley et al., 2001). ...We use commercial buildings from the Buildings-900K dataset (Emami et al., 2023b). ...This experiment uses the Wind Farm Wake Modeling Dataset (Ramos et al., 2023), made with the FLORIS simulator...
Dataset Splits	Yes	Our training set is comprised of 330K buildings, and we use 100 buildings for validation and 6K held-out buildings for testing. We also reserved a held-out set of 10K buildings for RFE. ... In this dataset, there are only 500 unique system configurations (split 3:1:1 for train, val, test), although each configuration is simulated under 500 distinct atmospheric conditions.
Hardware Specification	Yes	Generating these datasets with llama-2-7b-chat used 1.5K GPU hours on a cluster with 16 NVIDIA A100-40GB GPUs. ...All models are trained with a single NVIDIA A100-40GB GPU.
Software Dependencies	No	The paper mentions several software components, frameworks, and models such as 'llama-2-7b-chat (Touvron et al., 2023)', 'BERT (Devlin et al., 2018)', 'Distil BERT (Sanh et al., 2019)', 'Light GBM (Ke et al., 2017)', and 'Optuna (Akiba et al., 2019)'. However, it does not specify exact version numbers for these or other crucial software dependencies (e.g., Python, PyTorch/TensorFlow) required for reproducibility.
Experiment Setup	Yes	We carefully tune the hyperparameters of all models (details in Appendix A.2). See Table 6 for hyperparameter sweep details for the buildings experiments and Table 7 for hyperparameter sweep details for the wind farm experiments.