reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Learning and Planning with Compressed Predictive States

Authors: William Hamilton, Mahdi Milani Fard, Joelle Pineau

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The main empirical contribution of this work is the application of this approach to domains and sample-sizes of complexity not previously feasible for PSRs. Section 6 will highlight empirical results demonstrating the performance of the algorithm on some synthetic robot navigation domains and a diﬃcult real-world application task based upon the ecological management of migratory bird species.
Researcher Affiliation	Academia	William Hamilton EMAIL Mahdi Milani Fard EMAIL Joelle Pineau EMAIL School of Computer Science Mc Gill University Montreal, QC, Canada
Pseudocode	Yes	Algorithm 1: Fitted-Q with CPSR Inputs: A set D of tuples of the form (ct, at, rt, ct+1) constructed using a CPSR model, where rt is a numerical reward; R, a regression algorithm; γ, a discount factor; and T, a stopping condition Outputs: A policy π 2: Set ˆQk(ct, a) = 0 a A and all possible ct 3: repeat 5: Build training set, T = {(yl, il), l = 1, ..., \|D\|} where: il = (cl t, al t) and yl = rl t + γ maxa ˆQk 1(cl t+1, a) 6: Apply R to approximate ˆQk from T 7: until T is met output π, where π(ct) = argmaxa{ ˆQk(ct, a)} Algorithm 2: Combined learning and planning Inputs: πs, a sampling policy; N, the number of sampling iterations; Im, the number of trajectories to use in learning; and Ip, the number of trajectories to use in planning (Im Ip) Outputs: A CPSR model, C and policy π 2: Initialize the CPSR model, C 3: for i=1 to N do 4: Sample Im trajectories, Zi, using πs 5: Update C using Zi 6: Sub-sample Ip trajectories from Zi and use C to construct a tuple-set Di 7: Di Di Di 1 8: Apply Algorithm 1 with Di to learn a policy, πi 9: [Optional] Update πs (e.g., using πi) 10: end for output C and πN
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described in this paper.
Open Datasets	Yes	The second domain used is based upon the partially observable Pac Man domain, denoted Poc Man, ﬁrst proposed by Silver and Veness (2010). In Nicol et al. (2013) the AMM problem is formalized, and a simulator based dataset is provided (for a number of species including the Lesser Sand Plover). ... See Nicol et al. (2013) for a complete description. ... This diﬃcult real-world domain, which builds upon hand-crafted simulators and ecological data sets (Iwamura, 2011; Nicol et al., 2013)
Dataset Splits	Yes	All models used 10000 train trajectories (of max length 13) and were evaluated with 10000 trajectories.
Hardware Specification	Yes	Experiments were run on a machine with a 8-core 3.2 GHz Intel Xeon processor (x64 architecture) and 8GB of RAM.
Software Dependencies	No	The paper mentions software components like "Extra-Trees algorithm" and "CPLEX", but does not provide specific version numbers for the software used in their experiments. For example, it only mentions "Extra-Trees algorithm" without specifying its version or the library used to implement it.
Experiment Setup	Yes	For Colored Grid World, the models examined were identical to those described in the model quality experiments above. A discount factor of γ = 0.99 was used for this domain. ... For planning, we used Ip = 1000 with N = 1 and a random sampling strategy... And for the ﬁtted-Q algorithm, we used 100 ﬁtted-Q iterations, one Extra-Tree ensemble of 25 trees per action, and the default settings for the Extra-Trees (Geurts et al., 2006). ... For both Poc Man and S-Poc Man, we set d = 25 and examined compressed dimensions in the range [250, 500] (selecting only the top performer via a validation set); no TPSR models were used on these domains, as their construction exhausted the memory capacity of the machine used. Following Veness et al. (2011), for these domains we use γ = 0.99999 as a discount factor. ... We used a discount factor of γ = 0.99 for the AMM domain. For model-learning, we set d = 10 and examined compressed dimension in the range [10, 100].