reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controlling Equational Reasoning in Large Language Models with Prompt Interventions

Authors: Jordan Meadows, Marco Valentino, André Freitas

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments suggest that T5-Large can outperform the few-shot performance of GPT-4 on various evaluation sets generated via the framework. However, an extensive evaluation based on human analysis, template-based error detection, and text generation metrics reveals model weaknesses beyond what the reference-based metrics singularly describe.
Researcher Affiliation	Academia	Jordan Meadows1, Marco Valentino2, André Freitas1, 2, 3 1Department of Computer Science, University of Manchester 2Idiap Research Institute 3National Biomarker Centre, CRUK-MI EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Derivation Generation Input: Vocabulary of symbols V, Set of operations R Output: Ordered list of derivation steps D 1: Initialize derivation D with a premise step s1 = (premise equation, annotation)
Open Source Code	Yes	1Code, data, hyperparameters, and further experimental details available at: https://github.com/jmeadows17/deriving-equations-with-LLMs
Open Datasets	Yes	1. We construct and release a dataset of 30k mathematically fine-grained prompt-derivation pairs spanning 18 operators, 155 wildcard (La Te X) symbols, 4 targeted distribution shifts, up to 10 equations per derivation, and 160k steps all developed using a symbolic data generation framework. ... 1Code, data, hyperparameters, and further experimental details available at: https://github.com/jmeadows17/deriving-equations-with-LLMs
Dataset Splits	Yes	Dataset Size (k) Training 15.3 Static Test Set (In-distribution) 3.1 Variable Renaming (VR) 2.9 Expression Exchange (EE) 3.1 Alternative Goal (AG) 3.1 Step Removal (SR) 1.0 Table 1: Sizes for the various Derivation Generation datasets.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It mentions the models used (T5, GPT, LLa Ma) and fine-tuning, but not the underlying hardware.
Software Dependencies	No	The paper mentions the use of Large Language Models (LLMs) and a symbolic engine for data generation but does not provide specific version numbers for any software libraries, frameworks, or tools used (e.g., Python, PyTorch, TensorFlow, or specific symbolic algebra packages with versions).
Experiment Setup	Yes	We adopt the same set of hyperparameters described in Meadows et al. (2024) using the following values: p_history=10, p_arity_0=5, p_renaming=1, p_arity_1=50, p_evaluate=50, p_arity_2=100, p_int_or_diff=1, p_subs=5.