reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contextualize Me – The Case for Context in Reinforcement Learning

Authors: Carolin Benjamins, Theresa Eimer, Frederik Schubert, Aditya Mohan, Sebastian Döhler, André Biedenkapp, Bodo Rosenhahn, Frank Hutter, Marius Lindauer

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our goal is to show how the framework of c RL contributes to improving zero-shot generalization in RL through meaningful benchmarks and structured reasoning about generalization tasks. We conﬁrm the insight that optimal behavior in c RL requires context information, as in other related areas of partial observability. To empirically validate this in the c RL framework, we provide various context-extended versions of common RL environments. They are part of the ﬁrst benchmark library, CARL, designed for generalization based on c RL extensions of popular benchmarks, which we propose as a testbed to further study general agents. We show that in the contextual setting, even simple RL environments become challenging and that naive solutions are not enough to generalize across complex context spaces. We use our benchmark library to empirically show how diﬀerent context variations can signiﬁcantly increase the diﬃculty of training RL agents, even in simple environments. We further verify the intuition that allowing RL agents access to context information is beneﬁcial for generalization tasks in theory and practice. To explore our research questions, we use our benchmark library CARL. Details about the hyperparameter settings and used hardware for all experiments are listed in Appendix C.
Researcher Affiliation	Academia	Carolin Benjamins EMAIL Leibniz University Hannover Theresa Eimer EMAIL Leibniz University Hannover Frederik Schubert EMAIL Leibniz University Hannover Aditya Mohan EMAIL Leibniz University Hannover Sebastian Döhler EMAIL Leibniz University Hannover André Biedenkapp EMAIL University of Freiburg Bodo Rosenhahn EMAIL Leibniz University Hannover Frank Hutter EMAIL University of Freiburg Marius Lindauer EMAIL Leibniz University Hannover
Pseudocode	No	The paper describes methods and concepts through text and mathematical formulations. It includes figures and tables but no explicit 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	All experiments can be reproduced using the scripts we provide with the benchmark library at https://github.com/automl/CARL.
Open Datasets	Yes	In our release of CARL benchmarks, we include and contextually extend classic control and box2d environments from Open AI Gym (Brockman et al., 2016), Google Brax walkers (Freeman et al., 2021), a selection from the Deep Mind Control Suite (Tassa et al., 2018), an RNA folding environment (Runge et al., 2019) as well as Super Mario levels (Awiszus et al., 2020; Schubert et al., 2021), see Figure 4.
Dataset Splits	Yes	To generate instances of both environments, we vary the length of the pole across a uniform distribution p C = U(0.25, 0.75) around the standard pole length for Cart Pole and the pole length across p C = U(1, 2.2) for Pendulum. For training, we sample 64 contexts from this distribution and train a general agent which experiences all contexts during training in a round robin fashion. Afterwards, each agent is evaluated on each context it was trained on for 10 episodes. For the train and test context sets, we sample 1000 contexts each for the train and test distributions deﬁned in the evaluation protocol, see Figure 9. The test performances are discretized and aggregated across seeds by the bootstrapped mean using rliable (Agarwal et al., 2021). In Figure 9, we show that both hidden (context-oblivious) and visible (concatenate) agents perform fairly well within their training distribution for evaluation mode A and even generalize to fairly large areas of the test distribution, more so for concat. Large update intervals combined with extreme pole lengths proves to be the most challenging area. We repeat this with 10 random seeds and 5 test episodes per context.
Hardware Specification	Yes	Hardware All experiments on all benchmarks were conducted on a slurm CPU and GPU cluster (see Table 2). On the CPU partition there are 1592 CPUs available across nodes. Table 2: GPU NVIDIA Quattro M5000 1, GPU NVIDIA RTX 2080 Ti 56, GPU NVIDIA RTX 2080 Ti 12, GPU NVIDIA RTX 1080 Ti 6, GPU NVIDIA GTX Titan X 4, GPU NVIDIA GT 640 1
Software Dependencies	No	We implemented our own agents using coax (Holsheimer et al., 2023) with hyperparameters speciﬁed in Table 1. All experiments can be reproduced using the scripts we provide with the benchmark library at https://anonymous.4open.science/r/CARL-54F4/. The paper mentions the `coax` library and its associated publication year, but does not specify a version number for `coax` itself or for other core software components like Python, PyTorch/TensorFlow, or specific algorithms (C51, SAC, PPO) used.
Experiment Setup	Yes	Details about the hyperparameter settings and used hardware for all experiments are listed in Appendix C. Table 1: Hyperparameters for algorithm and environment combinations (algorithm, env, n_step, gamma, alpha, batch_size, learning_rate, q_targ_tau, warmup_num_frames, pi_warmup_num_frames, pi_update_freq, replay_capacity, network {width : 256, num_atoms : 51}, pi_temperature, q_min_value, q_max_value).