reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

Authors: Jaehyeon Son, Soochan Lee, Gunhee Kim

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DICP across a range of discrete and continuous environments, including Darkroom variants and Meta-World. Our results show that DICP achieves state-of-the-art performance while requiring significantly fewer environment interactions than baselines, which include both model-free counterparts and existing meta-RL methods.
Researcher Affiliation	Collaboration	Jaehyeon Son1, Soochan Lee2, Gunhee Kim1 1Seoul National University, 2LG AI Research EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Meta-Training Phase Algorithm 2 Meta-Test Phase Algorithm 3 Distillation for In-Context Planning (DICP)
Open Source Code	Yes	The code is available at https://github.com/jaehyeon-son/dicp.
Open Datasets	Yes	We evaluate our DICP framework across a diverse set of environments. For discrete environments, we use Darkroom, Dark Key-to-Door, and Darkroom-Permuted, which are well-established benchmarks for in-context RL studies (Laskin et al., 2023; Lee et al., 2023a; Huang et al., 2024). For continuous ones, we test on the Meta-World benchmark suite (Yu et al., 2019). REPRODUCIBILITY STATEMENT: We are committed to ensuring the full reproducibility of our research. Our open-sourced code enables easy replication of all experiments, and all datasets used in our experiments can be generated using our code.
Dataset Splits	Yes	Darkroom: The tasks are divided into disjoint training and test sets, with a 90:10 split. Dark Key-to-Door: The train-test split ratio is 95:5. Darkroom-Permuted: The train-test split ratio is the same as in Darkroom. Meta-World: We focus on ML1, which provides 50 pre-defined seeds for both training and test for each task.
Hardware Specification	No	The paper mentions "modern GPUs can process hundreds of TFLOPs per second" but does not specify the exact GPU models, CPUs, or other hardware used for their experiments. It lacks specific details such as model numbers or memory.
Software Dependencies	No	For the source algorithms (Schulman et al., 2017; Haarnoja et al., 2018), we use the implementation of Stable Baselines 3 (Raffin et al., 2021). ... We implement Transformers using the open-source Tiny Llama (Zhang et al., 2024). The paper mentions software names like Stable Baselines 3 and Tiny Llama, but does not provide specific version numbers for them.
Experiment Setup	Yes	Appendix A.1 SOURCE ALGORITHMS (Table 3: Hyperparameters for the source PPO algorithm; Table 4: Hyperparameters for the source SAC algorithm). Appendix A.2 TRANSFORMERS (Table 5: Hyperparameters for Transformers). Appendix A.4 PLANNING (Table 6: Hyperparameters for planning).