Steering Large Language Models between Code Execution and Textual Reasoning
Authors: Yongchao Chen, Harsh Jhamtani, Srinagesh Sharma, Chuchu Fan, Chi Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | However, based on our experiments on 7 existing popular methods for steering code/text generation in both singleand multi-turn settings with 14 tasks and 6 types of LLMs (including the new O1-preview), currently there is no optimal method to correctly steer LLMs to write code when needed. ... In this paper, we perform an in-depth investigation into the effectiveness of LLMs in steering between use of textual reasoning and code generation/execution across 14 diverse tasks requiring mathematical, verbal, and planning capabilities, using 6 types of LLMs (O1-preview, GPT-4o (Achiam et al., 2023), GPT-4o-mini, GPT-3.5 (Brown, 2020), Claude-sonnet (Anthropic, 2024), Mixtral8x7b (Jiang et al., 2024)). Section 3 is titled "EXPERIMENTS". |
| Researcher Affiliation | Collaboration | Yongchao Chen MIT / Harvard EMAIL Harsh Jhamtani Microsoft EMAIL Srinagesh Sharma Microsoft EMAIL Chuchu Fan MIT EMAIL Chi Wang Google Deep Mind EMAIL |
| Pseudocode | No | The paper describes methods and experiments but does not include any structured pseudocode or algorithm blocks for its own methodology. |
| Open Source Code | Yes | Project Page, Datasets, and Codes are available at https://yongchao98.github.io/Code Steer/. |
| Open Datasets | Yes | Project Page, Datasets, and Codes are available at https://yongchao98.github.io/Code Steer/. ... We carry out the experiments on 14 tasks across domains of math (Number Multiplying, Game 24, GSM-Hard, MATH-Geometry, MATH-Count&Probability (Hendrycks et al., 2021; Gao et al., 2023; Yao et al., 2024; Zhou et al., 2023a)), logical reasoning (Date Understanding, Web of Lies, Logical Deduction, Navigate (Suzgun et al., 2022; Gao et al., 2023)), robot planning (Box Net (Chen et al., 2024c), Path Plan (Li et al., 2023a; Chen et al., 2024a)), and symbolic calculation (Letters, Box Lift (Chen et al., 2024c), Blocksworld (Valmeekam et al., 2024)). |
| Dataset Splits | No | The paper refers to using "original dataset" prompts for testing and states "All the testing tasks comprise over 300 trials," but does not explicitly provide specific training/test/validation dataset splits for reproducibility, nor does it explicitly state that standard splits from cited datasets were used for their experimental setup. |
| Hardware Specification | Yes | Score vs. Runtime (including both LLM inference and code execution time on one Intel 16-core CPU). |
| Software Dependencies | No | The code solutions of all tasks use Python as the default language and avoid special packages to ensure consistency across different execution environments. This statement mentions Python as the language but does not specify a version number or any other software dependencies with their versions. |
| Experiment Setup | Yes | Prompt for the summarizer of method 9 Code + Text + Sum. ... Prompt for method 10 Self-estimate Score ... To prevent infinite loops, we set a 30-second time limit for code execution. ... The system prompts for all methods are set to empty unless specified otherwise. |