Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning
Authors: Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, Sheila A. McIlraith
JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide an empirical evaluation of our methods in domains with a variety of characteristics: discrete states, continuous states, and continuous action spaces. Some domains include multitask learning and single task learning. Most of the tasks considered have been expressed using simple reward machines, though those used in Section 5.3 require the full formulation as we describe below. As a brief summary, our results show the following: 1. CRM and HRM outperform the cross-product baselines in all our experiments. 2. CRM converges to the best policies in all but one experiment. 3. HRM tends to initially learn faster than CRM but converges to suboptimal policies. 4. The gap between CRM/HRM and the cross-product baseline increases when learning in a multitask setting. 5. Reward shaping helps in discrete domains but it does not in continuous domains. |
| Researcher Affiliation | Academia | Rodrigo Toro Icarte EMAIL Pontificia Universidad Católica de Chile, Santiago, Chile Vector Institute, Toronto, ON, Canada Toryn Q. Klassen EMAIL University of Toronto, Toronto, ON, Canada Vector Institute, Toronto, ON, Canada Richard Valenzano EMAIL Ryerson University, Toronto, ON, Canada Sheila A. Mc Ilraith EMAIL University of Toronto, Toronto, ON, Canada Vector Institute, Toronto, ON, Canada |
| Pseudocode | Yes | Algorithm 1 The cross-product baseline using tabular Q-learning. Algorithm 2 Tabular Q-learning with counterfactual experiences for RMs (CRM). Algorithm 3 Tabular hierarchical RL for reward machines (HRM). Algorithm 4 Value iteration for automated reward shaping |
| Open Source Code | Yes | Our code is available at github.com/Rodrigo Toro Icarte/reward_machines, including our environments, raw results, and implementations of the cross-product baseline, automated reward shaping, CRM, and HRM using tabular Q-learning, DDQN, and DDPG. For the experiments with QRM, we use the following implementation: bitbucket.org/RToro Icarte/qrm. |
| Open Datasets | Yes | Our final set of experiments considers the case where the action space is continuous. We ran experiments on the Half Cheetah-v3 environment (Brockman et al., 2016).... We tested our approaches in a continuous state space problem called the water world (Sidor, 2016; Karpathy, 2015). |
| Dataset Splits | No | The paper describes experiments in different environments (gridworlds, water world, Half Cheetah-v3) and settings (multitask, single task) using a number of independent trials (e.g., 60 independent trials, 10 randomly generated maps with 2 or 6 trials per map), but does not specify fixed training/test/validation dataset splits in the traditional sense of partitioning pre-collected static data. The data is generated through interaction with the environments. |
| Hardware Specification | Yes | The results on the Office and Craft domains were computed using one core on an Intel(R) Xeon(R) Gold 6148 processor. The results on the Water and Half-Cheetah domains were computed using one Tesla P100 GPU. |
| Software Dependencies | No | Our DDQN implementation was based on the code from Open AI Baselines (Hesse et al., 2017). Finally, we have released a new implementation of our code that is fully compatible with the Open AI Gym API (Brockman et al., 2016). ... the computation of gradients using larger mini-batches is parallelized by Tensor Flow. |
| Experiment Setup | Yes | We use ϵ = 0.1 for exploration, γ = 0.9, and α = 0.5. We also used optimistic initialization of the Q-values by setting the initial Q-value of any state-action pair to be 2. For the hierarchical RL methods, we use r+ = 1 and r = 0. We used a feed-forward network with 3 hidden layers and 1024 relu units per layer. We trained the networks using a learning rate of 10-5. On every step, we updated the Q-functions using 32n sampled experiences from a replay buffer of size 50000n, where n = 1 for DDQN and n = |U| for CRM. The target networks were updated every 100 training steps and the discount factor γ was 0.9. For HRM, we use the same feed-forward network and hyperparameters to train the option’s policies (although n = |A| in this case). The high-level policy was learned using DDQN but, since the high-level decision problem is simpler, we used a smaller network (2 layers with 256 relu units) and a larger learning rate (10-3). All the approaches use a feed-forward network with 2 layers and 256 relu units per layer. The batch size was 100n (where n = 1 in DDPG, n = |U| in CRM, and n = |A| in HRM) and the rest of the hyperparameters were set to their default values (Hesse et al., 2017). |