In-Context Reinforcement Learning From Suboptimal Historical Data
Authors: Juncheng Dong, Moyang Guo, Ethan X Fang, Zhuoran Yang, Vahid Tarokh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data. |
| Researcher Affiliation | Academia | 1Department of Electrical and Computer Engineering, Duke University, Durham, US 2Department of Biostatistics and Bioinformatics, Duke University, Durham, US 3Department of Statistics and Data Science, Yale University, New Haven, US. Correspondence to: Juncheng Dong <EMAIL>. |
| Pseudocode | Yes | I. Pseudocodes Algorithm 1 Pretraining of Decision Importance Transformer Algorithm 2 Deployment of In-Context RL Models |
| Open Source Code | No | The paper does not explicitly state that source code is released, nor does it provide a link to a code repository. It mentions using third-party tools like GPT-2 and Stable Baselines3 but not their own implementation code. |
| Open Datasets | No | The paper describes generating or constructing datasets for its experiments, such as sampling bandit features from Gaussian distributions or constructing datasets using historical trajectories generated by agents trained with Soft Actor Critic. It does not provide access information for any pre-existing public datasets. For instance, in Appendix E, it states: "Pretraining Dataset. For LB problems, we generate the feature function ϕ : A Rd by sampling bandit features from independent Gaussian distributions...". And "Pretraining Datasets for Meta-World and Half-Cheetah. We construct the pretraining datasets using historical trajectories generated by agents trained with Soft Actor Critic (SAC)." |
| Dataset Splits | Yes | For Dark Room and Miniworld: "We follow Lee et al. (2024) to use the tasks on 80 out of the 100 goals for pretraining, and reserve the rest 20 for testing.", "For Miniworld, we collect 40k context datasets (32k for training and 8k for validation), 10k datasets for each of the four tasks corresponding to four possible box colors." For Meta-World and Half-Cheetah: "In Meta-World, we used 15 tasks to train and 5 to test. Similarly, for Half-Cheetah, we used 35 tasks to train and 5 to test." |
| Hardware Specification | Yes | Our experiments can be conducted on a single A6000 GPU. |
| Software Dependencies | No | The paper mentions software like GPT-2 and Stable Baselines3 but does not provide specific version numbers for these, or other key software components. For example: "We follow Lee et al. (2024) to choose GPT-2 (Radford et al., 2019) as the backbone" and "For model and training settings, we use the default implementation from Stable Baselines3 (Raffin et al., 2021)." |
| Experiment Setup | Yes | We set γ = 0.8 for all tasks. We choose η = 1 for all tasks. For all methods, we use the Adam W optimizer with a weight decay of 1e 4, a learning rate of 1e 3, and a batch size of 128. Our model is based on a causal GPT-2 architecture (Radford et al., 2019). It consists of 6 attention layers, each with a single attention head, and an embedding size of 256. |