Deep Exploration via Randomized Value Functions

Authors: Ian Osband, Benjamin Van Roy, Daniel J. Russo, Zheng Wen

JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present several reinforcement learning algorithms that leverage randomized value functions and demonstrate their efficacy through computational studies. We also prove a regret bound that establishes statistical efficiency with a tabular representation. Keywords: Reinforcement learning, exploration, value function, neural network
Researcher Affiliation Collaboration Ian Osband EMAIL DeepMind Benjamin Van Roy EMAIL Stanford University Daniel J. Russo EMAIL Columbia University Zheng Wen EMAIL Adobe Research
Pseudocode Yes Algorithm 1 live Input: agent methods act,update buffer,learn from buffer environment methods reset,step 1: for ℓin (1,2,...) do 2: agent.learn from buffer() 3: transition environment.reset() 4: while transition.new state is not null do 5: action agent.act(transition.new state) 6: transition environment.step(action) 7: agent.update buffer(transition)
Open Source Code No The paper makes no explicit statement about the release of source code for the methodology described, nor does it provide any direct links to a code repository.
Open Datasets No The paper describes generating environments for the 'deep-sea exploration problem' and using a modified 'cartpole problem' with specific initial conditions and dynamics, but does not provide access information or references to publicly available datasets for their experiments. For example: 'We generate random deep-sea environments according to Example 1 and empirically evaluate performance over many simulations.'
Dataset Splits No The paper describes experimental setups where agents learn over episodes in simulated environments but does not specify traditional training/test/validation dataset splits. Data is generated during the learning process, for example: 'Each episode begins with s0 = (π,0,0,0) + w for wi Unif([ 0.05,0.05]) i.i.d. in each component.'
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models.
Software Dependencies No The paper mentions 'Pythonic pseudocode' and standard machine learning techniques like 'two-layer MLP with 50 rectified linear units' and 'Glorot initialization' but does not specify any software names with version numbers (e.g., Python version, specific deep learning framework like TensorFlow or PyTorch and their versions).
Experiment Setup Yes We apply learn ensemble rlsvi (Algorithm 8) for K = 1,5,10,20,40 and with an ensemble buffer that stores the most recent 105 transitions. For update we use update bootstrap (Algorithm 7) to approximate a double or nothing online bootstrap (Owen and Eckles, 2012). We use a discounted TD loss with γ = 0.99, learning rate α = 10 3 and minibatch size of 128. For our value function family Q we consider two-layer MLP with 50 rectified linear units in each layer.