Discriminator-Guided Embodied Planning for LLM Agent

Authors: Haofu Qian, Chenjia Bai, Jiatao Zhang, Fei Wu, Wei Song, Xuelong Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments across different LLMs (GPT-4, Llama3-70B) in Science World and Virtual Home, our method obtains superior performance and better efficiency than previous methods. Our findings are outlined in Tab. 5.1, which elucidates the performance across thirty distinct task types.
Researcher Affiliation Collaboration 1Zhejiang University 2Institute of Artificial Intelligence (Tele AI), China Telecom, 3Yuhang Humanoid Robot Industry Innovation Center, Hangzhou, China, 4Shenzhen Research Institute of Northwestern Polytechnical University
Pseudocode No The paper describes the framework and methods in detail, including an illustration of the DGAP framework in Figure 2, but does not present a clearly labeled pseudocode or algorithm block.
Open Source Code Yes To ensure reproducibility, the code for our experiments is available at https://github.com/ Hauff Qian/DGAP.
Open Datasets Yes To evaluate the effectiveness of DGAP and other baseline methods in complex embodied reasoning tasks, we employ the Science World (Wang et al., 2022) and Virtual Home (Puig et al., 2018) benchmark.
Dataset Splits No The paper describes how specific data subsets (expert, random, augmented) were collected and used for training the discriminator and how tasks were selected for evaluation, but it does not provide explicit train/test/validation dataset splits in terms of percentages or sample counts for the main experimental setup. For instance, it mentions 'selecting only the first 10 variations for tasks... resulting in a total of 270 task variations' for evaluation, but not overall dataset splits for training and testing models.
Hardware Specification Yes We employ four A100 GPUs for conducting this task, consuming eight hours. We employ four A100 GPUs for conducting this task, consuming around forty hours.
Software Dependencies No The paper mentions several models (FLAN-T5-large, RoBERTa, Llama3-70B, GPT-4), frameworks (Vanna), and optimizers (Adam, AdamW), but does not provide specific version numbers for these software components or programming languages/libraries (e.g., Python, PyTorch versions).
Experiment Setup Yes For the training, we employed the Adam optimizer with an epsilon value of 1e-06, a learning rate of 1e-4, and a batch size of 32. We conducted 3 training epochs comprising 25000 steps in total. The model was initialized with RoBERTa parameters and optimized using the AdamW optimizer a learning rate of 1e-5, a warmup rate of 0.1, and a batch size of 32. Specifically, we adopt a threshold of 5 for Science World and 6 for Virtual Home, based on their respective training data distributions.