Encoder of Thoughts: Enhancing Planning Ability in Language Agents Through Structural Embedding

Authors: Yuxiang Zhang, Jitao Sang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multi-step reasoning and plan generation demonstrate that Eo T significantly improves the performance of language agents. Evaluation results in the domains of mathematical reasoning and plan generation demonstrate three key contributions of Eo T: a) Effectiveness: Eo T demonstrated significant improvements in both mathematical reasoning and plan generation tasks. Specifically, it achieved a 5-10% performance increase on the GSM8K datase and 5-20% boost on the Blockworlds dataset. We evaluated the performance of the Eo T framework using three datasets: GSM8K, MATH, and Blockworlds.
Researcher Affiliation Academia 1Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University 2Peng Cheng Lab EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Eo T-BFS(s0, g, π, ρ, b, T) ... Algorithm 2: Eo T-MCTS(s0, g, π, ρ, N, T)
Open Source Code No No explicit statement or link for open-source code was found in the paper.
Open Datasets Yes We evaluated the performance of the Eo T framework using three datasets: GSM8K, MATH, and Blockworlds. The GSM8K dataset (Cobbe et al. 2021) consists of mathematical problems from elementary school exams... The MATH dataset (Hendrycks et al. 2021) includes more challenging competition-style problems... The Blockworlds dataset (Valmeekam et al. 2022) was used to evaluate model performance in plan generation tasks.
Dataset Splits No The paper describes how training data for EoT was generated and how test sets were constructed for some datasets, but it does not provide explicit train/validation/test dataset splits (percentages or counts) or refer to standard predefined splits for the experimental evaluation of the models. For the GSM8K dataset, we set the BFS depth to 5, sampled up to 3 instances per child node, and limited each layer to 4 nodes. We constructed the MATH test set by selecting 30 questions from each of the seven subjects. Except for the 100 samples used during the construction of the training data, the remaining data was split into six different difficulty levels according to the minimum steps required to complete the tasks.
Hardware Specification Yes The encoder training was conducted with a batch size of 32 on NVIDIA L20 GPUs.
Software Dependencies No The paper mentions using 'Llama3-3b-Instruct' and 'Qwen1.5-7B' as language model backbones, but does not provide specific version numbers for any programming languages, libraries, or frameworks used in the implementation.
Experiment Setup Yes The encoder training was conducted with a batch size of 32 on NVIDIA L20 GPUs. We used the Adam optimizer with a cosine-decaying learning rate, starting at 1 × 10−4. The training process lasted for 2000 steps. Each encoder was configured with a Graph Transformer featuring 2 layers, a hidden size of 512 dimensions, and 8 attention heads. During the evaluation of the BFS algorithm on the GSM8K and MATH datasets, we set the breadth limits b to 3 and 5, and the depth limits T to 5 and 7, respectively. Each action was sampled four times. For the MCTS configuration, we adopted the hyperparameters and reward settings proposed in RAP (Hao et al. 2023).