Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Representation Matters: Offline Pretraining for Sequential Decision Making
Authors: Mengjiao Yang, Ofir Nachum
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a variety of experiments utilizing standard offline RL datasets, we find that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms that otherwise yield mediocre performance on their own. Extensive ablations further provide insights into what components of these unsupervised objectives e.g., reward prediction, continuous or discrete representations, pretraining or finetuning are most important and in which settings. |
| Researcher Affiliation | Industry | 1Google Research, Google Brain. Correspondence to: Mengjiao Yang <EMAIL>. |
| Pseudocode | No | The paper describes various representation learning objectives but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code available at https://github.com/google-research/googleresearch/tree/master/rl_repr. |
| Open Datasets | Yes | We leverage the Gym-Mu Jo Co datasets from D4RL (Fu et al., 2020) |
| Dataset Splits | No | The paper describes using different datasets for pretraining and downstream tasks (e.g., D4RL medium/medium-replay for pretraining, D4RL expert for imitation learning) and evaluation frequency ('every 10k steps, we evaluate the learned policy'), but does not specify explicit train/validation/test dataset splits from a single dataset. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | Unless otherwise noted, a single seed corresponds to an initial pretraining phase of 200k steps, in which a representation learning objective is optimized using batches of 256 sub-trajectories randomly sampled from the offline dataset. After pretraining, the learned representation is fixed and applied to the downstream task, which performs the appropriate training (BC, BRAC, or SAC) for 1M steps. ...we fix these to values which we found to generally perform best (regularization strength of 1.0 and policy learning rate of 0.00003). |