Dynamic Subgoal-based Exploration via Bayesian Optimization
Authors: Yijia Wang, Matthias Poloczek, Daniel R. Jiang
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An experimental evaluation demonstrates that the new approach outperforms existing baselines across a number of problem domains. We now show numerical experiments to demonstrate the cost-effectiveness of the BESD framework. |
| Researcher Affiliation | Collaboration | Yijia Wang EMAIL University of Pittsburgh Matthias Poloczek EMAIL Amazon Daniel R. Jiang EMAIL Meta AI, University of Pittsburgh |
| Pseudocode | Yes | Algorithm 1 Bayesian Exploratory Subgoal Design 1. Set n = 0. Estimate hyperparameters of the GP prior f using initial samples. 2. Compute next decision (θn, τ n, qn) according to the acquisition function (7). 3. Train in environment ξn+1 augmented with θn (Mξn+1,θn) using levers (τ n, qn). 4. Observe yn+1(θn, τ n) and update posterior on f. 5. If n < N, increment n and return to Step 2. 6. Return a subgoal recommendation θN rec that maximizes µN(θ, τmax). |
| Open Source Code | Yes | BESD is implemented using the MOE package (Clark et al., 2014) and the full source code be found at the following URL: https://github.com/yjwang0618/subgoal-based-exploration. |
| Open Datasets | No | The first set of environments (GW10) is a distribution over 10 10 gridworlds... The second domain (GW20) is a distribution of larger 20 20 gridworlds... The third domain (TR) is a distribution of 10 10 gridworlds... The mountain car (MC) domain, as we introduced in Example 2, is a commonly used RL benchmark environment... In domains KEY2 (with two subgoals) and KEY3 (with three subgoals), we consider a 10 10 gridworld... |
| Dataset Splits | Yes | In our setup, an agent is given a fixed (and small) number of opportunities to train in environments randomly drawn from a distribution Ξ (henceforth, we refer to these as training environments)... After these opportunities are exhausted, the agent enters a random test environment ξ Ξ... For each replication, to assess the performance at a particular point in the process, we take its latest recommendation and test it by averaging its performance on a random sample of 200 test MDPs (i.e., ξN). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or other computer specifications used for running the experiments. It only mentions general computational context like 'RL training itself did not use prohibitive amounts of computation'. |
| Software Dependencies | No | BESD is implemented using the MOE package (Clark et al., 2014)... The underlying RL algorithm for all environments is Q-learning Watkins & Dayan (1992)... Both EI and LCB are implemented using the GPy Opt package González (2016). |
| Experiment Setup | Yes | The potential function at state s with the jth subgoal activated is Φj(s) = w1 exp[ -0.5(s j)2/w2], where the height is w1 = 0.2 and width is w2 = 10. The underlying RL algorithm for all environments is Q-learning with an ϵ-greedy behavioral policy (with ϵ = 0.2) for all environments. We use T = {200, 600, 1000} for the possible values of τ and Q = {5, 20} for the possible values of q [for GW10]. In this experiment, we consider the case of only allowing BESD to select the maximum episode length from T = {4000, 7000, 10000}, while keeping q = 20 fixed [for GW20]. The discount factor is set to γ = 0.98 [for TR]. Setting η = 3 (the default value) and R = 81, HB consists of logη R rounds. |