reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Discriminator-Guided Embodied Planning for LLM Agent

Authors: Haofu Qian, Chenjia Bai, Jiatao Zhang, Fei Wu, Wei Song, Xuelong Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments across different LLMs (GPT-4, Llama3-70B) in Science World and Virtual Home, our method obtains superior performance and better efficiency than previous methods. Our findings are outlined in Tab. 5.1, which elucidates the performance across thirty distinct task types.
Researcher Affiliation	Collaboration	1Zhejiang University 2Institute of Artificial Intelligence (Tele AI), China Telecom, 3Yuhang Humanoid Robot Industry Innovation Center, Hangzhou, China, 4Shenzhen Research Institute of Northwestern Polytechnical University
Pseudocode	No	The paper describes the framework and methods in detail, including an illustration of the DGAP framework in Figure 2, but does not present a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	To ensure reproducibility, the code for our experiments is available at https://github.com/ Hauff Qian/DGAP.
Open Datasets	Yes	To evaluate the effectiveness of DGAP and other baseline methods in complex embodied reasoning tasks, we employ the Science World (Wang et al., 2022) and Virtual Home (Puig et al., 2018) benchmark.
Dataset Splits	No	The paper describes how specific data subsets (expert, random, augmented) were collected and used for training the discriminator and how tasks were selected for evaluation, but it does not provide explicit train/test/validation dataset splits in terms of percentages or sample counts for the main experimental setup. For instance, it mentions 'selecting only the first 10 variations for tasks... resulting in a total of 270 task variations' for evaluation, but not overall dataset splits for training and testing models.
Hardware Specification	Yes	We employ four A100 GPUs for conducting this task, consuming eight hours. We employ four A100 GPUs for conducting this task, consuming around forty hours.
Software Dependencies	No	The paper mentions several models (FLAN-T5-large, RoBERTa, Llama3-70B, GPT-4), frameworks (Vanna), and optimizers (Adam, AdamW), but does not provide specific version numbers for these software components or programming languages/libraries (e.g., Python, PyTorch versions).
Experiment Setup	Yes	For the training, we employed the Adam optimizer with an epsilon value of 1e-06, a learning rate of 1e-4, and a batch size of 32. We conducted 3 training epochs comprising 25000 steps in total. The model was initialized with RoBERTa parameters and optimized using the AdamW optimizer a learning rate of 1e-5, a warmup rate of 0.1, and a batch size of 32. Specifically, we adopt a threshold of 5 for Science World and 6 for Virtual Home, based on their respective training data distributions.