Policy Evaluation with Temporal Differences: A Survey and Comparison

Authors: Christoph Dann, Gerhard Neumann, Jan Peters

JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual-gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.
Researcher Affiliation Academia Christoph Dann EMAIL Gerhard Neumann EMAIL Technische Universität Darmstadt Karolinenplatz 5 64289 Darmstadt, Germany Jan Peters EMAIL Max Planck Institute for Intelligent Systems Spemannstraße 38 72076 Tübingen, Germany
Pseudocode Yes Appendix C. Algorithms The following pseudo-code listings show the update rules of all discussed temporal-difference algorithms. These updates are executed for each transition from st to st+1 performing action at and getting the reward rt. Algorithm 1 TD(λ) Learning ... Algorithm 13 Residual-gradient algorithm without double-samples
Open Source Code Yes All algorithms were implemented in Python. The source code for each method and experiment is available at http:// github.com/chrodan/tdlearn.
Open Datasets No The paper describes several benchmark environments (e.g., "14-State Boyan Chain" and "Baird s Star Example") and a "Randomly Sampled MDP" but does not provide explicit links, DOIs, or citations to pre-collected datasets. The experimental setup involves simulating these environments or sampling the MDPs, rather than loading static, publicly available data files for analysis.
Dataset Splits No The paper mentions generating data through "roll-outs" or "samples" from MDPs. It states, "We computed the algorithms predictions with an increasing number of training data points". This indicates an online or simulated data generation process rather than predefined train/test/validation splits from a static dataset.
Hardware Specification Yes The results are averages of 10 independent runs executed on a single core of an i7 Intel CPU.
Software Dependencies No All algorithms were implemented in Python. However, no specific version of Python or any other software libraries used were mentioned.
Experiment Setup Yes The behavior of policy evaluation methods can be influenced by adjusting their hyperparameters. We set those parameters by performing an exhaustive grid-search in the hyperparameter space minimizing the MSBE (for the residual-gradient algorithm and BRM) or MSPBE. Table 3: Considered values in the grid-search parameter optimization for the algorithms listed in Table 4.