reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Encoder of Thoughts: Enhancing Planning Ability in Language Agents Through Structural Embedding

Authors: Yuxiang Zhang, Jitao Sang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on multi-step reasoning and plan generation demonstrate that Eo T significantly improves the performance of language agents. Evaluation results in the domains of mathematical reasoning and plan generation demonstrate three key contributions of Eo T: a) Effectiveness: Eo T demonstrated significant improvements in both mathematical reasoning and plan generation tasks. Specifically, it achieved a 5-10% performance increase on the GSM8K datase and 5-20% boost on the Blockworlds dataset. We evaluated the performance of the Eo T framework using three datasets: GSM8K, MATH, and Blockworlds.
Researcher Affiliation	Academia	1Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University 2Peng Cheng Lab EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Eo T-BFS(s0, g, π, ρ, b, T) ... Algorithm 2: Eo T-MCTS(s0, g, π, ρ, N, T)
Open Source Code	No	No explicit statement or link for open-source code was found in the paper.
Open Datasets	Yes	We evaluated the performance of the Eo T framework using three datasets: GSM8K, MATH, and Blockworlds. The GSM8K dataset (Cobbe et al. 2021) consists of mathematical problems from elementary school exams... The MATH dataset (Hendrycks et al. 2021) includes more challenging competition-style problems... The Blockworlds dataset (Valmeekam et al. 2022) was used to evaluate model performance in plan generation tasks.
Dataset Splits	No	The paper describes how training data for EoT was generated and how test sets were constructed for some datasets, but it does not provide explicit train/validation/test dataset splits (percentages or counts) or refer to standard predefined splits for the experimental evaluation of the models. For the GSM8K dataset, we set the BFS depth to 5, sampled up to 3 instances per child node, and limited each layer to 4 nodes. We constructed the MATH test set by selecting 30 questions from each of the seven subjects. Except for the 100 samples used during the construction of the training data, the remaining data was split into six different difficulty levels according to the minimum steps required to complete the tasks.
Hardware Specification	Yes	The encoder training was conducted with a batch size of 32 on NVIDIA L20 GPUs.
Software Dependencies	No	The paper mentions using 'Llama3-3b-Instruct' and 'Qwen1.5-7B' as language model backbones, but does not provide specific version numbers for any programming languages, libraries, or frameworks used in the implementation.
Experiment Setup	Yes	The encoder training was conducted with a batch size of 32 on NVIDIA L20 GPUs. We used the Adam optimizer with a cosine-decaying learning rate, starting at 1 × 10−4. The training process lasted for 2000 steps. Each encoder was configured with a Graph Transformer featuring 2 layers, a hidden size of 512 dimensions, and 8 attention heads. During the evaluation of the BFS algorithm on the GSM8K and MATH datasets, we set the breadth limits b to 3 and 5, and the depth limits T to 5 and 7, respectively. Each action was sampled four times. For the MCTS configuration, we adopted the hyperparameters and reward settings proposed in RAP (Hao et al. 2023).