Steering Large Language Models between Code Execution and Textual Reasoning

Authors: Yongchao Chen, Harsh Jhamtani, Srinagesh Sharma, Chuchu Fan, Chi Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental However, based on our experiments on 7 existing popular methods for steering code/text generation in both singleand multi-turn settings with 14 tasks and 6 types of LLMs (including the new O1-preview), currently there is no optimal method to correctly steer LLMs to write code when needed. ... In this paper, we perform an in-depth investigation into the effectiveness of LLMs in steering between use of textual reasoning and code generation/execution across 14 diverse tasks requiring mathematical, verbal, and planning capabilities, using 6 types of LLMs (O1-preview, GPT-4o (Achiam et al., 2023), GPT-4o-mini, GPT-3.5 (Brown, 2020), Claude-sonnet (Anthropic, 2024), Mixtral8x7b (Jiang et al., 2024)). Section 3 is titled "EXPERIMENTS".
Researcher Affiliation Collaboration Yongchao Chen MIT / Harvard EMAIL Harsh Jhamtani Microsoft EMAIL Srinagesh Sharma Microsoft EMAIL Chuchu Fan MIT EMAIL Chi Wang Google Deep Mind EMAIL
Pseudocode No The paper describes methods and experiments but does not include any structured pseudocode or algorithm blocks for its own methodology.
Open Source Code Yes Project Page, Datasets, and Codes are available at https://yongchao98.github.io/Code Steer/.
Open Datasets Yes Project Page, Datasets, and Codes are available at https://yongchao98.github.io/Code Steer/. ... We carry out the experiments on 14 tasks across domains of math (Number Multiplying, Game 24, GSM-Hard, MATH-Geometry, MATH-Count&Probability (Hendrycks et al., 2021; Gao et al., 2023; Yao et al., 2024; Zhou et al., 2023a)), logical reasoning (Date Understanding, Web of Lies, Logical Deduction, Navigate (Suzgun et al., 2022; Gao et al., 2023)), robot planning (Box Net (Chen et al., 2024c), Path Plan (Li et al., 2023a; Chen et al., 2024a)), and symbolic calculation (Letters, Box Lift (Chen et al., 2024c), Blocksworld (Valmeekam et al., 2024)).
Dataset Splits No The paper refers to using "original dataset" prompts for testing and states "All the testing tasks comprise over 300 trials," but does not explicitly provide specific training/test/validation dataset splits for reproducibility, nor does it explicitly state that standard splits from cited datasets were used for their experimental setup.
Hardware Specification Yes Score vs. Runtime (including both LLM inference and code execution time on one Intel 16-core CPU).
Software Dependencies No The code solutions of all tasks use Python as the default language and avoid special packages to ensure consistency across different execution environments. This statement mentions Python as the language but does not specify a version number or any other software dependencies with their versions.
Experiment Setup Yes Prompt for the summarizer of method 9 Code + Text + Sum. ... Prompt for method 10 Self-estimate Score ... To prevent infinite loops, we set a 30-second time limit for code execution. ... The system prompts for all methods are set to empty unless specified otherwise.