reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models

Authors: Cong Lu, Shengran Hu, Jeff Clune

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our algorithm on a diverse range of language and vision-based tasks that require search and exploration. Across these tasks, IGE strongly exceeds classic reinforcement learning and graph search baselines, and also succeeds where prior state-of-the-art FM agents like Reflexion completely fail. Overall, INTELLIGENT GO-EXPLORE combines the tremendous strengths of FMs and the powerful Go-Explore algorithm, opening up a new frontier of research into creating more generally capable agents with impressive exploration capabilities. All our code is open-sourced at: https://github.com/conglu1997/ intelligent-go-explore.
Researcher Affiliation	Academia	Cong Lu1,2 EMAIL Shengran Hu1,2 EMAIL Jeff Clune1,2,3 EMAIL 1University of British Columbia 2Vector Institute 3Canada CIFAR AI Chair
Pseudocode	Yes	We illustrate our resultant algorithm at the top of Figure 1 and provide full pseudocode in Algorithm 1.
Open Source Code	Yes	All our code is open-sourced at: https://github.com/conglu1997/ intelligent-go-explore.
Open Datasets	Yes	We first demonstrate the effectiveness of IGE in a mathematical reasoning task, Game of 24 (Yao et al., 2023a). The goal is to perform basic arithmetic operations (+, , , /) starting from 4 numbers to obtain 24. ... Next, we show that IGE readily operates across multiple modalities in the Baby AI domains from Carta et al. (2023). ... Finally, we show IGE s ability to tackle tasks requiring long-horizon memory and planning, exploration, and commonsense in Text World (Cˆot e et al., 2018), a classic text-based agent benchmark.
Dataset Splits	Yes	We evaluate IGE across 100 hard test problems in Figure 2
Hardware Specification	No	We used GPT-4-Turbo for Game of 24 and GPT-4o for Baby AI and Text World. This was purely done to select the version of GPT-4 that was available and the cheapest at the time of running the experiments. The version of GPT-4 is consistent per environment.
Software Dependencies	No	The paper mentions using specific versions of large language models like GPT-4-Turbo and GPT-4o, but does not specify other ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or their corresponding version numbers.
Experiment Setup	Yes	Full hyperparameters are detailed in Appendix E. We list the hyperparameters for IGE in Table 6. We list the sampling parameters for GPT-4 (Open AI, 2024) passed via the Open AI API in Table 7.