reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Authors: Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, Chandan Reddy

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate LLM-SR on four benchmark problems across diverse scientific domains (e.g., physics, biology), which we carefully designed to simulate the discovery process and prevent LLM recitation. Our results demonstrate that LLM-SR discovers physically accurate equations that significantly outperform state-of-the-art symbolic regression baselines, particularly in out-of-domain test settings.
Researcher Affiliation	Collaboration	Parshin Shojaee1 Kazem Meidani2 Shashank Gupta3 Amir Barati Farimani2 Chandan K. Reddy1 1Virginia Tech 2Carnegie Mellon University 3Allen Institute for AI
Pseudocode	Yes	Algorithm 1: LLM-SR Input :LLM πθ, dataset D, problem T , T iterations, k in-context examples, b samples per prompt # Initialize population P0 Init Pop() f , s null, for t 1 to T 1 do # Sample examples from buffer E {ej}k j=1, ej = Sample Exp(Pt 1) # Prompt with new examples p Make Few Shot Prompt(E) # Sample from LLM Ft {fj}b j=1, fj πθ( \|p) # Evaluation and population update for f Ft do s Score T (f, D) if s > s then f , s f, s Pt Pt 1 {(f, s)} end end end Output :f , s
Open Source Code	Yes	Code and data are available: https://github.com/deep-symbolic-mathematics/LLM-SR
Open Datasets	Yes	The datasets used in this study include both publicly available and newly generated data. The material stress behavior analysis dataset (stress-strain) is publicly available under the CC BY 4.0 license and can be accessed at https://data.mendeley.com/datasets/rd6jm9tyb6/1. The remaining datasets (Oscillation 1, Oscillation 2, and E. coli Growth) were generated for this work and are released under the MIT License as part of the LLM-SR Git Hub repository: https://github.com/deep-symbolic-mathematics/LLM-SR
Dataset Splits	Yes	To effectively evaluate the generalization capability of predicted equations, we employ a strategic data partitioning scheme. The simulation data is divided into three sets based on the trajectory time: (1) Training set, (2) In-domain validation set, and (3) Out-of-domain validation set. Specifically, we utilize the time interval T = [0, 20) to evaluate the out-of-domain generalization of the discovered equations.
Hardware Specification	Yes	Our experiments employ either Mixtral-8x7B (using 4 NVIDIA RTX 8000 GPUs with 48GB memory each) or GPT-3.5-turbo (via Open AI API) as the language model backbone.
Software Dependencies	No	The paper mentions using Python, the `scipy` library for `numpy+BFGS` optimization, and `PyTorch` for `torch+Adam` optimization, as well as `Mixtral-8x7B` and `GPT-3.5-turbo` as LLM backbones. However, it does not provide specific version numbers for Python, `scipy`, or `PyTorch`.
Experiment Setup	Yes	In LLM-SR experiments, each iteration samples b = 4 equation skeletons per prompt with temperature τ = 0.8, optimizes parameters via numpy+BFGS or torch+Adam (with 30 seconds timeout), and uses k = 2 in-context examples from the experience buffer for refinement. To control the length and the complexity of the generated equations and prevent overparameterization, we set the maximum number of parameters (length of params vector) as 10 in all experiments. Evaluation is constrained by time and memory limits set at T = 30 seconds and M = 2GB, respectively.