reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Deep Exploration via Randomized Value Functions

Authors: Ian Osband, Benjamin Van Roy, Daniel J. Russo, Zheng Wen

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present several reinforcement learning algorithms that leverage randomized value functions and demonstrate their eﬃcacy through computational studies. We also prove a regret bound that establishes statistical eﬃciency with a tabular representation. Keywords: Reinforcement learning, exploration, value function, neural network
Researcher Affiliation	Collaboration	Ian Osband EMAIL DeepMind Benjamin Van Roy EMAIL Stanford University Daniel J. Russo EMAIL Columbia University Zheng Wen EMAIL Adobe Research
Pseudocode	Yes	Algorithm 1 live Input: agent methods act,update buffer,learn from buffer environment methods reset,step 1: for ℓin (1,2,...) do 2: agent.learn from buffer() 3: transition environment.reset() 4: while transition.new state is not null do 5: action agent.act(transition.new state) 6: transition environment.step(action) 7: agent.update buffer(transition)
Open Source Code	No	The paper makes no explicit statement about the release of source code for the methodology described, nor does it provide any direct links to a code repository.
Open Datasets	No	The paper describes generating environments for the 'deep-sea exploration problem' and using a modified 'cartpole problem' with specific initial conditions and dynamics, but does not provide access information or references to publicly available datasets for their experiments. For example: 'We generate random deep-sea environments according to Example 1 and empirically evaluate performance over many simulations.'
Dataset Splits	No	The paper describes experimental setups where agents learn over episodes in simulated environments but does not specify traditional training/test/validation dataset splits. Data is generated during the learning process, for example: 'Each episode begins with s0 = (π,0,0,0) + w for wi Unif([ 0.05,0.05]) i.i.d. in each component.'
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models.
Software Dependencies	No	The paper mentions 'Pythonic pseudocode' and standard machine learning techniques like 'two-layer MLP with 50 rectiﬁed linear units' and 'Glorot initialization' but does not specify any software names with version numbers (e.g., Python version, specific deep learning framework like TensorFlow or PyTorch and their versions).
Experiment Setup	Yes	We apply learn ensemble rlsvi (Algorithm 8) for K = 1,5,10,20,40 and with an ensemble buffer that stores the most recent 105 transitions. For update we use update bootstrap (Algorithm 7) to approximate a double or nothing online bootstrap (Owen and Eckles, 2012). We use a discounted TD loss with γ = 0.99, learning rate α = 10 3 and minibatch size of 128. For our value function family Q we consider two-layer MLP with 50 rectiﬁed linear units in each layer.