Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

Authors: Jaehyeon Son, Soochan Lee, Gunhee Kim

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate DICP across a range of discrete and continuous environments, including Darkroom variants and Meta-World. Our results show that DICP achieves state-of-the-art performance while requiring significantly fewer environment interactions than baselines, which include both model-free counterparts and existing meta-RL methods.
Researcher Affiliation Collaboration Jaehyeon Son1, Soochan Lee2, Gunhee Kim1 1Seoul National University, 2LG AI Research EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Meta-Training Phase Algorithm 2 Meta-Test Phase Algorithm 3 Distillation for In-Context Planning (DICP)
Open Source Code Yes The code is available at https://github.com/jaehyeon-son/dicp.
Open Datasets Yes We evaluate our DICP framework across a diverse set of environments. For discrete environments, we use Darkroom, Dark Key-to-Door, and Darkroom-Permuted, which are well-established benchmarks for in-context RL studies (Laskin et al., 2023; Lee et al., 2023a; Huang et al., 2024). For continuous ones, we test on the Meta-World benchmark suite (Yu et al., 2019). REPRODUCIBILITY STATEMENT: We are committed to ensuring the full reproducibility of our research. Our open-sourced code enables easy replication of all experiments, and all datasets used in our experiments can be generated using our code.
Dataset Splits Yes Darkroom: The tasks are divided into disjoint training and test sets, with a 90:10 split. Dark Key-to-Door: The train-test split ratio is 95:5. Darkroom-Permuted: The train-test split ratio is the same as in Darkroom. Meta-World: We focus on ML1, which provides 50 pre-defined seeds for both training and test for each task.
Hardware Specification No The paper mentions "modern GPUs can process hundreds of TFLOPs per second" but does not specify the exact GPU models, CPUs, or other hardware used for their experiments. It lacks specific details such as model numbers or memory.
Software Dependencies No For the source algorithms (Schulman et al., 2017; Haarnoja et al., 2018), we use the implementation of Stable Baselines 3 (Raffin et al., 2021). ... We implement Transformers using the open-source Tiny Llama (Zhang et al., 2024). The paper mentions software names like Stable Baselines 3 and Tiny Llama, but does not provide specific version numbers for them.
Experiment Setup Yes Appendix A.1 SOURCE ALGORITHMS (Table 3: Hyperparameters for the source PPO algorithm; Table 4: Hyperparameters for the source SAC algorithm). Appendix A.2 TRANSFORMERS (Table 5: Hyperparameters for Transformers). Appendix A.4 PLANNING (Table 6: Hyperparameters for planning).