reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

In-Context Reinforcement Learning From Suboptimal Historical Data

Authors: Juncheng Dong, Moyang Guo, Ethan X Fang, Zhuoran Yang, Vahid Tarokh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.
Researcher Affiliation	Academia	1Department of Electrical and Computer Engineering, Duke University, Durham, US 2Department of Biostatistics and Bioinformatics, Duke University, Durham, US 3Department of Statistics and Data Science, Yale University, New Haven, US. Correspondence to: Juncheng Dong <EMAIL>.
Pseudocode	Yes	I. Pseudocodes Algorithm 1 Pretraining of Decision Importance Transformer Algorithm 2 Deployment of In-Context RL Models
Open Source Code	No	The paper does not explicitly state that source code is released, nor does it provide a link to a code repository. It mentions using third-party tools like GPT-2 and Stable Baselines3 but not their own implementation code.
Open Datasets	No	The paper describes generating or constructing datasets for its experiments, such as sampling bandit features from Gaussian distributions or constructing datasets using historical trajectories generated by agents trained with Soft Actor Critic. It does not provide access information for any pre-existing public datasets. For instance, in Appendix E, it states: "Pretraining Dataset. For LB problems, we generate the feature function ϕ : A Rd by sampling bandit features from independent Gaussian distributions...". And "Pretraining Datasets for Meta-World and Half-Cheetah. We construct the pretraining datasets using historical trajectories generated by agents trained with Soft Actor Critic (SAC)."
Dataset Splits	Yes	For Dark Room and Miniworld: "We follow Lee et al. (2024) to use the tasks on 80 out of the 100 goals for pretraining, and reserve the rest 20 for testing.", "For Miniworld, we collect 40k context datasets (32k for training and 8k for validation), 10k datasets for each of the four tasks corresponding to four possible box colors." For Meta-World and Half-Cheetah: "In Meta-World, we used 15 tasks to train and 5 to test. Similarly, for Half-Cheetah, we used 35 tasks to train and 5 to test."
Hardware Specification	Yes	Our experiments can be conducted on a single A6000 GPU.
Software Dependencies	No	The paper mentions software like GPT-2 and Stable Baselines3 but does not provide specific version numbers for these, or other key software components. For example: "We follow Lee et al. (2024) to choose GPT-2 (Radford et al., 2019) as the backbone" and "For model and training settings, we use the default implementation from Stable Baselines3 (Raffin et al., 2021)."
Experiment Setup	Yes	We set γ = 0.8 for all tasks. We choose η = 1 for all tasks. For all methods, we use the Adam W optimizer with a weight decay of 1e 4, a learning rate of 1e 3, and a batch size of 128. Our model is based on a causal GPT-2 architecture (Radford et al., 2019). It consists of 6 attention layers, each with a single attention head, and an embedding size of 256.