Encoder of Thoughts: Enhancing Planning Ability in Language Agents Through Structural Embedding
Authors: Yuxiang Zhang, Jitao Sang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on multi-step reasoning and plan generation demonstrate that Eo T significantly improves the performance of language agents. Evaluation results in the domains of mathematical reasoning and plan generation demonstrate three key contributions of Eo T: a) Effectiveness: Eo T demonstrated significant improvements in both mathematical reasoning and plan generation tasks. Specifically, it achieved a 5-10% performance increase on the GSM8K datase and 5-20% boost on the Blockworlds dataset. We evaluated the performance of the Eo T framework using three datasets: GSM8K, MATH, and Blockworlds. |
| Researcher Affiliation | Academia | 1Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University 2Peng Cheng Lab EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Eo T-BFS(s0, g, π, ρ, b, T) ... Algorithm 2: Eo T-MCTS(s0, g, π, ρ, N, T) |
| Open Source Code | No | No explicit statement or link for open-source code was found in the paper. |
| Open Datasets | Yes | We evaluated the performance of the Eo T framework using three datasets: GSM8K, MATH, and Blockworlds. The GSM8K dataset (Cobbe et al. 2021) consists of mathematical problems from elementary school exams... The MATH dataset (Hendrycks et al. 2021) includes more challenging competition-style problems... The Blockworlds dataset (Valmeekam et al. 2022) was used to evaluate model performance in plan generation tasks. |
| Dataset Splits | No | The paper describes how training data for EoT was generated and how test sets were constructed for some datasets, but it does not provide explicit train/validation/test dataset splits (percentages or counts) or refer to standard predefined splits for the experimental evaluation of the models. For the GSM8K dataset, we set the BFS depth to 5, sampled up to 3 instances per child node, and limited each layer to 4 nodes. We constructed the MATH test set by selecting 30 questions from each of the seven subjects. Except for the 100 samples used during the construction of the training data, the remaining data was split into six different difficulty levels according to the minimum steps required to complete the tasks. |
| Hardware Specification | Yes | The encoder training was conducted with a batch size of 32 on NVIDIA L20 GPUs. |
| Software Dependencies | No | The paper mentions using 'Llama3-3b-Instruct' and 'Qwen1.5-7B' as language model backbones, but does not provide specific version numbers for any programming languages, libraries, or frameworks used in the implementation. |
| Experiment Setup | Yes | The encoder training was conducted with a batch size of 32 on NVIDIA L20 GPUs. We used the Adam optimizer with a cosine-decaying learning rate, starting at 1 × 10−4. The training process lasted for 2000 steps. Each encoder was configured with a Graph Transformer featuring 2 layers, a hidden size of 512 dimensions, and 8 attention heads. During the evaluation of the BFS algorithm on the GSM8K and MATH datasets, we set the breadth limits b to 3 and 5, and the depth limits T to 5 and 7, respectively. Each action was sampled four times. For the MCTS configuration, we adopted the hyperparameters and reward settings proposed in RAP (Hao et al. 2023). |