Dynamic Subgoal-based Exploration via Bayesian Optimization

Authors: Yijia Wang, Matthias Poloczek, Daniel R. Jiang

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental An experimental evaluation demonstrates that the new approach outperforms existing baselines across a number of problem domains. We now show numerical experiments to demonstrate the cost-effectiveness of the BESD framework.
Researcher Affiliation Collaboration Yijia Wang EMAIL University of Pittsburgh Matthias Poloczek EMAIL Amazon Daniel R. Jiang EMAIL Meta AI, University of Pittsburgh
Pseudocode Yes Algorithm 1 Bayesian Exploratory Subgoal Design 1. Set n = 0. Estimate hyperparameters of the GP prior f using initial samples. 2. Compute next decision (θn, τ n, qn) according to the acquisition function (7). 3. Train in environment ξn+1 augmented with θn (Mξn+1,θn) using levers (τ n, qn). 4. Observe yn+1(θn, τ n) and update posterior on f. 5. If n < N, increment n and return to Step 2. 6. Return a subgoal recommendation θN rec that maximizes µN(θ, τmax).
Open Source Code Yes BESD is implemented using the MOE package (Clark et al., 2014) and the full source code be found at the following URL: https://github.com/yjwang0618/subgoal-based-exploration.
Open Datasets No The first set of environments (GW10) is a distribution over 10 10 gridworlds... The second domain (GW20) is a distribution of larger 20 20 gridworlds... The third domain (TR) is a distribution of 10 10 gridworlds... The mountain car (MC) domain, as we introduced in Example 2, is a commonly used RL benchmark environment... In domains KEY2 (with two subgoals) and KEY3 (with three subgoals), we consider a 10 10 gridworld...
Dataset Splits Yes In our setup, an agent is given a fixed (and small) number of opportunities to train in environments randomly drawn from a distribution Ξ (henceforth, we refer to these as training environments)... After these opportunities are exhausted, the agent enters a random test environment ξ Ξ... For each replication, to assess the performance at a particular point in the process, we take its latest recommendation and test it by averaging its performance on a random sample of 200 test MDPs (i.e., ξN).
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or other computer specifications used for running the experiments. It only mentions general computational context like 'RL training itself did not use prohibitive amounts of computation'.
Software Dependencies No BESD is implemented using the MOE package (Clark et al., 2014)... The underlying RL algorithm for all environments is Q-learning Watkins & Dayan (1992)... Both EI and LCB are implemented using the GPy Opt package González (2016).
Experiment Setup Yes The potential function at state s with the jth subgoal activated is Φj(s) = w1 exp[ -0.5(s j)2/w2], where the height is w1 = 0.2 and width is w2 = 10. The underlying RL algorithm for all environments is Q-learning with an ϵ-greedy behavioral policy (with ϵ = 0.2) for all environments. We use T = {200, 600, 1000} for the possible values of τ and Q = {5, 20} for the possible values of q [for GW10]. In this experiment, we consider the case of only allowing BESD to select the maximum episode length from T = {4000, 7000, 10000}, while keeping q = 20 fixed [for GW20]. The discount factor is set to γ = 0.98 [for TR]. Setting η = 3 (the default value) and R = 81, HB consists of logη R rounds.