reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Steering Large Language Models between Code Execution and Textual Reasoning

Authors: Yongchao Chen, Harsh Jhamtani, Srinagesh Sharma, Chuchu Fan, Chi Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	However, based on our experiments on 7 existing popular methods for steering code/text generation in both singleand multi-turn settings with 14 tasks and 6 types of LLMs (including the new O1-preview), currently there is no optimal method to correctly steer LLMs to write code when needed. ... In this paper, we perform an in-depth investigation into the effectiveness of LLMs in steering between use of textual reasoning and code generation/execution across 14 diverse tasks requiring mathematical, verbal, and planning capabilities, using 6 types of LLMs (O1-preview, GPT-4o (Achiam et al., 2023), GPT-4o-mini, GPT-3.5 (Brown, 2020), Claude-sonnet (Anthropic, 2024), Mixtral8x7b (Jiang et al., 2024)). Section 3 is titled "EXPERIMENTS".
Researcher Affiliation	Collaboration	Yongchao Chen MIT / Harvard EMAIL Harsh Jhamtani Microsoft EMAIL Srinagesh Sharma Microsoft EMAIL Chuchu Fan MIT EMAIL Chi Wang Google Deep Mind EMAIL
Pseudocode	No	The paper describes methods and experiments but does not include any structured pseudocode or algorithm blocks for its own methodology.
Open Source Code	Yes	Project Page, Datasets, and Codes are available at https://yongchao98.github.io/Code Steer/.
Open Datasets	Yes	Project Page, Datasets, and Codes are available at https://yongchao98.github.io/Code Steer/. ... We carry out the experiments on 14 tasks across domains of math (Number Multiplying, Game 24, GSM-Hard, MATH-Geometry, MATH-Count&Probability (Hendrycks et al., 2021; Gao et al., 2023; Yao et al., 2024; Zhou et al., 2023a)), logical reasoning (Date Understanding, Web of Lies, Logical Deduction, Navigate (Suzgun et al., 2022; Gao et al., 2023)), robot planning (Box Net (Chen et al., 2024c), Path Plan (Li et al., 2023a; Chen et al., 2024a)), and symbolic calculation (Letters, Box Lift (Chen et al., 2024c), Blocksworld (Valmeekam et al., 2024)).
Dataset Splits	No	The paper refers to using "original dataset" prompts for testing and states "All the testing tasks comprise over 300 trials," but does not explicitly provide specific training/test/validation dataset splits for reproducibility, nor does it explicitly state that standard splits from cited datasets were used for their experimental setup.
Hardware Specification	Yes	Score vs. Runtime (including both LLM inference and code execution time on one Intel 16-core CPU).
Software Dependencies	No	The code solutions of all tasks use Python as the default language and avoid special packages to ensure consistency across different execution environments. This statement mentions Python as the language but does not specify a version number or any other software dependencies with their versions.
Experiment Setup	Yes	Prompt for the summarizer of method 9 Code + Text + Sum. ... Prompt for method 10 Self-estimate Score ... To prevent infinite loops, we set a 30-second time limit for code execution. ... The system prompts for all methods are set to empty unless specified otherwise.