Enhancing Language Model Agents using Diversity of Thoughts

Authors: Vijay Chandra Lingam, Behrooz Tehrani, sujay sanghavi, Gaurav Gupta, Sayan Ghosh, Linbo Liu, Jun Huan, Anoop Deoras

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on a suite of programming benchmarks (Human Eval, MBPP, and Leet Code Hard Gym) using a variety of LMs, Do T demonstrates up to a 10% improvement in Pass@1 while maintaining cost-effectiveness. Furthermore, Do T is modular by design. For instance, when the diverse reflection module of Do T is integrated with existing methods like Tree of Thoughts (To T), we observe a significant 13% improvement on Game of 24 (one of the main benchmarks of To T), highlighting the broad applicability and impact of our contributions across various reasoning tasks.
Researcher Affiliation Industry AWS AI Labs Amazon
Pseudocode Yes Algorithm 1 Do T and Do T-bank Framework
Open Source Code Yes 1https://github.com/amazon-science/Diversity Of Thoughts
Open Datasets Yes To evaluate the programming capabilities of Do T and Do T-bank, we conduct experiments on Human Eval(Chen et al., 2021a), MBPP (Austin et al., 2021), and Leet Code Hard Gym (Shinn et al., 2023) benchmarks, using Pass@1 as our primary metric.
Dataset Splits Yes During solution generation, only visible or synthetic test cases are used to ensure the validity of Pass@1. The final solution is then assessed on hidden test cases, assigning a Pass@1 score of 1 if all hidden tests are passed, and 0 otherwise. ... Human Eval 164 problems, 3 visible test cases/problem
Hardware Specification No The paper mentions evaluating using various LLMs (e.g., Llama-3.1, Claude Sonnet 3.5, GPT-4o) and LLM APIs. However, it does not specify the underlying hardware (e.g., GPU models, CPU types) used by the authors to run their experiments or interact with these APIs. The information is insufficient.
Software Dependencies No The paper mentions using "all-Mini-LM-v6 model from Sentence Transformers" and "Cohere Embed-V3-English for embedding generation". While specific models and libraries are named, their corresponding version numbers are not provided, nor are general software dependencies like Python or PyTorch versions. The information is insufficient.
Experiment Setup Yes Hyperparameters. We use the hyperparameters recommended by the authors of the respective baselines. For Reflexion, we set max-iterations to k = 3 for Human Eval and MBPP, and k = 5 for Leet Code Hard Gym. For LATS, k = 8 is used across all datasets. For Do T and Do T-bank, we set k = 3 for Human Eval and MBPP, and k = 5 for Leet Code Hard Gym. In Appendix A.2.4, we show that naively increasing k (the total number of generated reflections) in Reflexion has minimal impact on performance, highlighting that Do T variants are both more effective and cost-efficient. Note: unless specified, the number of retrieved trajectories is 1 for all Do T-bank experiments.