No More Pesky Hyperparameters: Offline Hyperparameter Tuning for RL

Authors: Han Wang, Archit Sakhadeo, Adam M White, James M Bell, Vincent Liu, Xutong Zhao, Puer Liu, Tadashi Kozuno, Alona Fyshe, Martha White

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically investigate the method in a variety of settings to identify when it is effective and when it fails. We conducted a battery of experiments to provide a rounded assessment of when an approach can or cannot be expected to reliably select good hyperparameters for online learning.
Researcher Affiliation Academia Han Wang EMAIL Archit Sakhadeo EMAIL Adam White EMAIL James Bell EMAIL Vincent Liu EMAIL Xutong Zhao EMAIL Puer Liu EMAIL Tadashi Kozuno EMAIL Alona Fyshe EMAIL Martha White EMAIL These authors contributed equally to this work. Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta, Edmonton, Alberta, Canada. Published in Transactions on Machine Learning Research (07/2022)
Pseudocode Yes Algorithm 1 Hyperparameter Selection with Calibration Models using Grid Search Algorithm 2 Agent Perf In Env Algorithm 3 Learn KNN Calibration Model Algorithm 4 Sample KNN Calibration Model
Open Source Code No The paper does not provide an explicit statement about releasing its own code or a link to a repository for the methodology described. It references an open-source package for Bayesian Optimization, stating: "We use an open-source package (Nogueira, 2014 ), which uses gaussian processes for optimizing the hyperparameter setting. We chose to use upper confidence bounds, with a confidence level of 2.576 the default in the package as the acquisition method. The queue is initialized with 5 random samples and the algorithm is run for 200 iterations."
Open Datasets No We conducted a battery of experiments to provide a rounded assessment of when an approach can or cannot be expected to reliably select good hyperparameters for online learning. We investigate varying the data collection policy and size of the data logs to mimic a variety of deployment scenarios ranging from a near-optimal operator to random data. In this first experiment we select the hyperparameters for a linear softmax-policy Expected Sarsa agent (from here on, linear Sarsa) from data generated by a simple policy with good coverage.
Dataset Splits No The paper describes generating "data logs" of various sizes (e.g., "5000 transitions data log", "500, 1000, and 5000 samples") which are then used to train the calibration model. However, it does not specify explicit training, validation, or test splits for these data logs in the traditional supervised learning sense. Instead, the data logs are used to train the calibration model, and then the agent learns within the simulated environment provided by the calibration model.
Hardware Specification No Experiments were conducted on a cluster and a powerful workstation using 8327 CPU hours and no GPUs.
Software Dependencies No The paper mentions using "an open-source package (Nogueira, 2014 )" for Bayesian optimization, but it does not specify any other software dependencies with version numbers for their own implementation (e.g., programming language version, libraries, or frameworks).
Experiment Setup Yes We investigate several dimensions of hyperparameters including the step-size and momentum parameters of the Adam optimizer, the temperature parameter of the policy, and the value function weight initialization. We optimize the temperature τ and stepsize α as continuous values in the ranges [0.0001, 5.0] and (0.0, 0.1] respectively for Acrobot, and [0.0001, 10.0] and [0.0, 1.0] respectively for Puddle World. The queue is initialized with 5 random samples and the algorithm is run for 200 iterations. Both random search and CEM use 100 iterations.