Event Tables for Efficient Experience Replay
Authors: Varun Raj Kompella, Thomas Walsh, Samuel Barrett, Peter R. Wurman, Peter Stone
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a theoretical underpinning for the fast-lane intuition and show that, if the events are correlated with optimal behavior and histories are sufficiently long, SSET can dramatically speed up the convergence of off-policy learning compared to using uniform sampling or even PER. Even if those conditions fail, a bias correction term preserves the Bellman target, although convergence may be slowed. From our empirical results, these properties translate to different off-policy RL base learners including DDQN (Van Hasselt et al., 2016), SAC (Haarnoja et al., 2018), and the distributional QR-SAC (Wurman et al., 2022) algorithm. ... (3) We empirically demonstrate the advantages of SSET over uniform sampling or PER in challenging Mini Grid environments and continuous RL benchmarks (Mu Jo Co and Lunar Lander), and find that combining SSET with TD-error PER or potential-based reward shaping can further improve learning speed. |
| Researcher Affiliation | Collaboration | Varun Kompella EMAIL Sony AI Thomas J. Walsh EMAIL Sony AI Samuel Barrett EMAIL Sony AI Peter R. Wurman EMAIL Sony AI Peter Stone EMAIL Sony AI The University of Texas at Austin, Austin, TX 78712, USA |
| Pseudocode | Yes | Algorithm 1: SSET: Stratified Sampling from Event Tables //input parameters: 1 νi = (ωi, τi) i [1, n]: event conditions and their corresponding history lengths 2 (ηi, κi, di) i [0, n]: sampling probabilities, capacity sizes, and minimum data requirements 3 A, πb, T Off-policy RL algorithm, behavior policy and episode length 4 B0 [](κ0), Bνi [](κi) i [1, n], 5 for episode k = 1 to do 6 E [] // init episode buffer 7 for t = 0 to T 1 (or episode termination) do 8 s current state observation 9 execute action a sampled from πb(s) 10 observe r and next state s 11 store transition (s, a, r, s ) in B0 and E 12 for i = 1 to n do //update event tables 13 if ωi(s ) then 14 store last τi transitions [Et τi+1, ..., Et] in Bνi 18 D B0 i Bνi, i s.t. |Bνi| di //sample & concat i.i.d. minibatches 19 Update weights for bias correction using E // (Lemma 2 in Appendix) 20 Update critic and policy networks using A and D |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code for the methodology described. It mentions 'Reverb (Cassirer et al., 2021)' as a publicly available package that implements a data structure efficiently, but this is a third-party tool used by the authors, not their own code. |
| Open Datasets | Yes | Empirical results in challenging Mini Grid domains, benchmark RL environments, and a high-fidelity car racing simulator demonstrate the advantages and versatility of SSET over existing ER buffer sampling approaches. ... Mini Grid (Chevalier-Boisvert et al., 2018) domain ... continuous RL benchmarks (Mu Jo Co (Todorov et al., 2012) suite defined in Open AI Gym (Brockman et al., 2016)) |
| Dataset Splits | No | The paper describes various experimental setups and evaluations within simulated environments like Mini Grid, Mujoco, Lunar Lander, and Gran Turismo Sport. It details how policies are evaluated (e.g., 'evaluated after every 5 epochs', 'starting from a slightly shifted initial grid position'), but it does not specify explicit training/test/validation dataset splits (e.g., percentages or counts of data for each split), as the data is generated dynamically within these environments. |
| Hardware Specification | Yes | For all the Mini Grid experiments, we used two asynchronous rollout workers each with 1.5 CPUs and 2GB memory... The training was conducted using two virtual CPUs and 3 GB of memory at a rate of 40 training steps per sec. For the continuous control tasks (Lunar Lander Continuous and Mu Jo Co domains), we used a single rollout worker with 1 CPU and 2GB memory, and the training was conducted asynchronously using 7.7 CPUs with 8GB of memory. ... The Gran Turismo (GT) Sport ... Our experiments used 21 Play Stations in parallel during training with 1 of those Play Stations typically devoted to evaluations. ... Training was conducted using one NVIDIA V100 or half of an NVIDIA A100, 8 virtual CPUs and 55 GB of RAM. |
| Software Dependencies | No | The paper mentions several algorithms and frameworks like DDQN (Van Hasselt et al., 2016), SAC (Haarnoja et al., 2018), QR-SAC (Wurman et al., 2022), and Open AI Gym (Brockman et al., 2016), and the Reverb library (Cassirer et al., 2021). However, it does not provide specific version numbers for any of these software dependencies or programming languages used for implementation. |
| Experiment Setup | Yes | Table 1: Parameters in the Mini Grid domain lists 'Learning rate 1e-3', 'batch-size 32', 'epsilon-greedy (ϵ) 0.3', 'TD-Priority exponent 0.65', 'Stale network refresh rate 0.01', 'Discount factor 0.99'. Table 2: Parameters in the Continuous Control Benchmark Tasks lists 'Learning rates Actor: 0.0003, Critic: 0.0003', 'batch-size 256', 'Target entropy (optimized entropy)', 'Stale network refresh rate 0.005', 'Discount factor 0.99'. For Gran Turismo: 'Mini-batches of size 1024 were used with 6000 mini-batches per epoch. ... The Adam optimizer was used to optimize the weights with learning rates of 5 10 5 for the Q-networks and 2.5 10 5 for the policy network. A discount factor of 0.9896 was used and the SAC entropy temperature controlling exploration was 0.01. Dropout (0.1) was used when learning the policy network.' |