Discriminator-Guided Embodied Planning for LLM Agent
Authors: Haofu Qian, Chenjia Bai, Jiatao Zhang, Fei Wu, Wei Song, Xuelong Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments across different LLMs (GPT-4, Llama3-70B) in Science World and Virtual Home, our method obtains superior performance and better efficiency than previous methods. Our findings are outlined in Tab. 5.1, which elucidates the performance across thirty distinct task types. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2Institute of Artificial Intelligence (Tele AI), China Telecom, 3Yuhang Humanoid Robot Industry Innovation Center, Hangzhou, China, 4Shenzhen Research Institute of Northwestern Polytechnical University |
| Pseudocode | No | The paper describes the framework and methods in detail, including an illustration of the DGAP framework in Figure 2, but does not present a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | To ensure reproducibility, the code for our experiments is available at https://github.com/ Hauff Qian/DGAP. |
| Open Datasets | Yes | To evaluate the effectiveness of DGAP and other baseline methods in complex embodied reasoning tasks, we employ the Science World (Wang et al., 2022) and Virtual Home (Puig et al., 2018) benchmark. |
| Dataset Splits | No | The paper describes how specific data subsets (expert, random, augmented) were collected and used for training the discriminator and how tasks were selected for evaluation, but it does not provide explicit train/test/validation dataset splits in terms of percentages or sample counts for the main experimental setup. For instance, it mentions 'selecting only the first 10 variations for tasks... resulting in a total of 270 task variations' for evaluation, but not overall dataset splits for training and testing models. |
| Hardware Specification | Yes | We employ four A100 GPUs for conducting this task, consuming eight hours. We employ four A100 GPUs for conducting this task, consuming around forty hours. |
| Software Dependencies | No | The paper mentions several models (FLAN-T5-large, RoBERTa, Llama3-70B, GPT-4), frameworks (Vanna), and optimizers (Adam, AdamW), but does not provide specific version numbers for these software components or programming languages/libraries (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | For the training, we employed the Adam optimizer with an epsilon value of 1e-06, a learning rate of 1e-4, and a batch size of 32. We conducted 3 training epochs comprising 25000 steps in total. The model was initialized with RoBERTa parameters and optimized using the AdamW optimizer a learning rate of 1e-5, a warmup rate of 0.1, and a batch size of 32. Specifically, we adopt a threshold of 5 for Science World and 6 for Virtual Home, based on their respective training data distributions. |