Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning
Authors: Jaehyeon Son, Soochan Lee, Gunhee Kim
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DICP across a range of discrete and continuous environments, including Darkroom variants and Meta-World. Our results show that DICP achieves state-of-the-art performance while requiring significantly fewer environment interactions than baselines, which include both model-free counterparts and existing meta-RL methods. |
| Researcher Affiliation | Collaboration | Jaehyeon Son1, Soochan Lee2, Gunhee Kim1 1Seoul National University, 2LG AI Research EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Meta-Training Phase Algorithm 2 Meta-Test Phase Algorithm 3 Distillation for In-Context Planning (DICP) |
| Open Source Code | Yes | The code is available at https://github.com/jaehyeon-son/dicp. |
| Open Datasets | Yes | We evaluate our DICP framework across a diverse set of environments. For discrete environments, we use Darkroom, Dark Key-to-Door, and Darkroom-Permuted, which are well-established benchmarks for in-context RL studies (Laskin et al., 2023; Lee et al., 2023a; Huang et al., 2024). For continuous ones, we test on the Meta-World benchmark suite (Yu et al., 2019). REPRODUCIBILITY STATEMENT: We are committed to ensuring the full reproducibility of our research. Our open-sourced code enables easy replication of all experiments, and all datasets used in our experiments can be generated using our code. |
| Dataset Splits | Yes | Darkroom: The tasks are divided into disjoint training and test sets, with a 90:10 split. Dark Key-to-Door: The train-test split ratio is 95:5. Darkroom-Permuted: The train-test split ratio is the same as in Darkroom. Meta-World: We focus on ML1, which provides 50 pre-defined seeds for both training and test for each task. |
| Hardware Specification | No | The paper mentions "modern GPUs can process hundreds of TFLOPs per second" but does not specify the exact GPU models, CPUs, or other hardware used for their experiments. It lacks specific details such as model numbers or memory. |
| Software Dependencies | No | For the source algorithms (Schulman et al., 2017; Haarnoja et al., 2018), we use the implementation of Stable Baselines 3 (Raffin et al., 2021). ... We implement Transformers using the open-source Tiny Llama (Zhang et al., 2024). The paper mentions software names like Stable Baselines 3 and Tiny Llama, but does not provide specific version numbers for them. |
| Experiment Setup | Yes | Appendix A.1 SOURCE ALGORITHMS (Table 3: Hyperparameters for the source PPO algorithm; Table 4: Hyperparameters for the source SAC algorithm). Appendix A.2 TRANSFORMERS (Table 5: Hyperparameters for Transformers). Appendix A.4 PLANNING (Table 6: Hyperparameters for planning). |