Controlling Equational Reasoning in Large Language Models with Prompt Interventions
Authors: Jordan Meadows, Marco Valentino, André Freitas
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments suggest that T5-Large can outperform the few-shot performance of GPT-4 on various evaluation sets generated via the framework. However, an extensive evaluation based on human analysis, template-based error detection, and text generation metrics reveals model weaknesses beyond what the reference-based metrics singularly describe. |
| Researcher Affiliation | Academia | Jordan Meadows1, Marco Valentino2, André Freitas1, 2, 3 1Department of Computer Science, University of Manchester 2Idiap Research Institute 3National Biomarker Centre, CRUK-MI EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Derivation Generation Input: Vocabulary of symbols V, Set of operations R Output: Ordered list of derivation steps D 1: Initialize derivation D with a premise step s1 = (premise equation, annotation) |
| Open Source Code | Yes | 1Code, data, hyperparameters, and further experimental details available at: https://github.com/jmeadows17/deriving-equations-with-LLMs |
| Open Datasets | Yes | 1. We construct and release a dataset of 30k mathematically fine-grained prompt-derivation pairs spanning 18 operators, 155 wildcard (La Te X) symbols, 4 targeted distribution shifts, up to 10 equations per derivation, and 160k steps all developed using a symbolic data generation framework. ... 1Code, data, hyperparameters, and further experimental details available at: https://github.com/jmeadows17/deriving-equations-with-LLMs |
| Dataset Splits | Yes | Dataset Size (k) Training 15.3 Static Test Set (In-distribution) 3.1 Variable Renaming (VR) 2.9 Expression Exchange (EE) 3.1 Alternative Goal (AG) 3.1 Step Removal (SR) 1.0 Table 1: Sizes for the various Derivation Generation datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It mentions the models used (T5, GPT, LLa Ma) and fine-tuning, but not the underlying hardware. |
| Software Dependencies | No | The paper mentions the use of Large Language Models (LLMs) and a symbolic engine for data generation but does not provide specific version numbers for any software libraries, frameworks, or tools used (e.g., Python, PyTorch, TensorFlow, or specific symbolic algebra packages with versions). |
| Experiment Setup | Yes | We adopt the same set of hyperparameters described in Meadows et al. (2024) using the following values: p_history=10, p_arity_0=5, p_renaming=1, p_arity_1=50, p_evaluate=50, p_arity_2=100, p_int_or_diff=1, p_subs=5. |