Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies

Authors: Shivakanth Sujit, Pedro Braga, Jorg Bornschein, Samira Ebrahimi Kahou

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets.
Researcher Affiliation Collaboration Shivakanth Sujit EMAIL ÉTS Montreal, Mila Quebec Pedro H. M. Braga EMAIL Universidade Federal de Pernambuco, Mila Quebec Jorg Bornschein EMAIL Deep Mind Samira Ebrahimi Kahou EMAIL ÉTS Montréal, Mila Quebec, CIFAR AI Chair
Pseudocode Yes Algorithm 1 Algorithm for Sequential Evaluation in the offline setting. 1: Input: Algorithm A, Offline data D = {st, at, rt, st+1}T t=1, increment-size γ, gradient steps per increment K, evaluation frequency fe 2: Replay-buffer B {st, at, rt, st+1}T0 t=1 3: t T0 4: while t < T do 5: Update replay-buffer B B {st, at, rt, st+1}t+γ t 6: Sample a training batch, ensure new data is included: batch B 7: Perform training step with A on batch. 8: t t + γ 9: for j = 1, , K do 10: Sample a training batch B 11: Perform training step with A on batch. 12: end for 13: if t % fe = 0 then 14: Evaluate A in the environment and log performance. 15: end if 16: end while
Open Source Code Yes The code to reproduce our experiments is available at https://shivakanthsujit.github.io/seq_eval.
Open Datasets Yes These algorithms were evaluated on the D4RL benchmark (Fu et al., 2020), which consists of three environments: Halfcheetah-v2, Walker2d-v2 and Hopper-v2. We also created a dataset from the Deep Mind Control Suite (DMC) (Tassa et al., 2018) environments following the same procedure as outlined by the authors of D4RL. Finally, to study algorithms in visual offline RL domains, we used the v-d4rl benchmark (Lu et al., 2023) which follows the philosophy of D4RL and creates datasets of images from the DMC Suite with varying difficulties.
Dataset Splits Yes For each environment, we evaluate four versions of the offline dataset: random, medium, medium-expert, and medium-replay. Random consists of 1M data points collected using a random policy. Medium contains 1M data points from a policy that was trained for one-third of the time needed for an expert policy, while medium-replay is the replay buffer that was used to train the policy. Medium-expert consists of a mix of 1M samples from the medium policy and 1M samples from the expert policy. In this dataset, the first 33% of data comes from the random dataset, the next 33% from the medium dataset and the final 33% from the expert dataset.
Hardware Specification No No specific hardware details (like GPU/CPU models or specific machine configurations) are mentioned in the paper for running the experiments.
Software Dependencies No The paper mentions the 'rliable (Agarwal et al., 2021) library' but does not provide a specific version number. No other software dependencies with version numbers are explicitly stated.
Experiment Setup Yes We set γ and K each to 1, that is, there is one gradient update performed on a batch of data sampled from the environment every time a data point is added to the buffer. The x-axis in the performance curves represents the amount of data in the replay buffer. In each plot, we also include the performance of the policy that generated the dataset as a baseline, which provides context for how much information each algorithm was able to extract from the dataset. We set γ and K each to 1, that is, there is one gradient update performed on a batch of data sampled from the environment every time a data point is added to the buffer. The x-axis in the performance curves represents the amount of data in the replay buffer. For each dataset, we train algorithms following Alg. 1, initializing the replay buffer with 5000 data points at the start of training.