Deflated Dynamics Value Iteration

Authors: Jongmin Lee, Amin Rakhsha, Ernest K. Ryu, Amir-massoud Farahmand

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show the effectiveness of the proposed algorithms. Finally, in Section 6, we empirically evaluate the proposed methods and show their practical feasibility. For our experiments, we use the following environments: Maze with 5 5 states and 4 actions, Cliffwalk with 3 7 states with 4 actions, Chain Walk with 50 states with 2 actions, and random Garnet MDPs (Bhatnagar et al., 2009) with 200 states. The discount factor is set to γ = 0.99 for the comparison of DDVI with different ranks and the DDTD experiments, and γ = 0.995 in other experiments. Appendix D provides full definitions of the environments and policies used for PE. All experiments were carried out on local CPUs. We report the normalized error of V k defined as V k V π 1/ V π 1.2
Researcher Affiliation Academia Jongmin Lee EMAIL Seoul National University Amin Rakhsha EMAIL Department of Computer Science, University of Toronto Vector Institute Ernest K. Ryu EMAIL University of California, Los Angeles Amir-massoud Farahmand EMAIL Polytechnique Montréal Mila Quebec AI Institute University of Toronto
Pseudocode Yes Algorithm 1 DDVI with Auto PI Algorithm 2 Rank-s DDVI with the QR Iteration. Algorithm 3 Rank-s DDTD with QR iteration
Open Source Code Yes 2The source code for the experiments can be found at https://github.com/adaptive-agents-lab/ddvi.
Open Datasets Yes For our experiments, we use the following environments: Maze with 5 5 states and 4 actions, Cliffwalk with 3 7 states with 4 actions, Chain Walk with 50 states with 2 actions, and random Garnet MDPs (Bhatnagar et al., 2009) with 200 states. Garnet We use the Garnet environment as described by Farahmand & Ghavamzadeh (2021); Rakhsha et al. (2022), which is based on Bhatnagar et al. (2009).
Dataset Splits No The paper uses Markov Decision Process (MDP) environments (Maze, Cliffwalk, Chain Walk, Garnet) which are simulated environments rather than static datasets with predefined splits. Data for these experiments is generated through agent interaction within the environment, and the paper does not specify traditional training/test/validation splits of a pre-collected dataset.
Hardware Specification No All experiments were carried out on local CPUs. (Section 6) This statement is too general as it does not specify any particular CPU model, generation, or number of cores. It lacks the specific details required for reproducibility.
Software Dependencies No We use the Implicitly Restarted Arnoldi Method (Lehoucq et al., 1998) from Sci Py package to calculate the eigenvalues and eigenvectors for DDVI. This mentions the 'Sci Py package' but does not provide a specific version number, which is crucial for reproducibility.
Experiment Setup Yes For our experiments, we use the following environments: Maze with 5 5 states and 4 actions, Cliffwalk with 3 7 states with 4 actions, Chain Walk with 50 states with 2 actions, and random Garnet MDPs (Bhatnagar et al., 2009) with 200 states. The discount factor is set to γ = 0.99 for the comparison of DDVI with different ranks and the DDTD experiments, and γ = 0.995 in other experiments. In all experiments, we set DDVI s α = 0.99. We perform an extensive comparison of DDVI against the prior accelerated VI methods... For PID VI, we set η = 0.05 and ϵ = 10 10. In Anderson VI, we have m = 5. The hyperparamaters of TD Learning and DDTD are given in Tables 1,2, and 3. (Appendix G)