reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Language Model Agents using Diversity of Thoughts

Authors: Vijay Chandra Lingam, Behrooz Tehrani, sujay sanghavi, Gaurav Gupta, Sayan Ghosh, Linbo Liu, Jun Huan, Anoop Deoras

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on a suite of programming benchmarks (Human Eval, MBPP, and Leet Code Hard Gym) using a variety of LMs, Do T demonstrates up to a 10% improvement in Pass@1 while maintaining cost-effectiveness. Furthermore, Do T is modular by design. For instance, when the diverse reflection module of Do T is integrated with existing methods like Tree of Thoughts (To T), we observe a significant 13% improvement on Game of 24 (one of the main benchmarks of To T), highlighting the broad applicability and impact of our contributions across various reasoning tasks.
Researcher Affiliation	Industry	AWS AI Labs Amazon
Pseudocode	Yes	Algorithm 1 Do T and Do T-bank Framework
Open Source Code	Yes	1https://github.com/amazon-science/Diversity Of Thoughts
Open Datasets	Yes	To evaluate the programming capabilities of Do T and Do T-bank, we conduct experiments on Human Eval(Chen et al., 2021a), MBPP (Austin et al., 2021), and Leet Code Hard Gym (Shinn et al., 2023) benchmarks, using Pass@1 as our primary metric.
Dataset Splits	Yes	During solution generation, only visible or synthetic test cases are used to ensure the validity of Pass@1. The final solution is then assessed on hidden test cases, assigning a Pass@1 score of 1 if all hidden tests are passed, and 0 otherwise. ... Human Eval 164 problems, 3 visible test cases/problem
Hardware Specification	No	The paper mentions evaluating using various LLMs (e.g., Llama-3.1, Claude Sonnet 3.5, GPT-4o) and LLM APIs. However, it does not specify the underlying hardware (e.g., GPU models, CPU types) used by the authors to run their experiments or interact with these APIs. The information is insufficient.
Software Dependencies	No	The paper mentions using "all-Mini-LM-v6 model from Sentence Transformers" and "Cohere Embed-V3-English for embedding generation". While specific models and libraries are named, their corresponding version numbers are not provided, nor are general software dependencies like Python or PyTorch versions. The information is insufficient.
Experiment Setup	Yes	Hyperparameters. We use the hyperparameters recommended by the authors of the respective baselines. For Reflexion, we set max-iterations to k = 3 for Human Eval and MBPP, and k = 5 for Leet Code Hard Gym. For LATS, k = 8 is used across all datasets. For Do T and Do T-bank, we set k = 3 for Human Eval and MBPP, and k = 5 for Leet Code Hard Gym. In Appendix A.2.4, we show that naively increasing k (the total number of generated reflections) in Reflexion has minimal impact on performance, highlighting that Do T variants are both more effective and cost-efficient. Note: unless specified, the number of retrieved trajectories is 1 for all Do T-bank experiments.