Deflated Dynamics Value Iteration
Authors: Jongmin Lee, Amin Rakhsha, Ernest K. Ryu, Amir-massoud Farahmand
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show the effectiveness of the proposed algorithms. Finally, in Section 6, we empirically evaluate the proposed methods and show their practical feasibility. For our experiments, we use the following environments: Maze with 5 5 states and 4 actions, Cliffwalk with 3 7 states with 4 actions, Chain Walk with 50 states with 2 actions, and random Garnet MDPs (Bhatnagar et al., 2009) with 200 states. The discount factor is set to γ = 0.99 for the comparison of DDVI with different ranks and the DDTD experiments, and γ = 0.995 in other experiments. Appendix D provides full definitions of the environments and policies used for PE. All experiments were carried out on local CPUs. We report the normalized error of V k defined as V k V π 1/ V π 1.2 |
| Researcher Affiliation | Academia | Jongmin Lee EMAIL Seoul National University Amin Rakhsha EMAIL Department of Computer Science, University of Toronto Vector Institute Ernest K. Ryu EMAIL University of California, Los Angeles Amir-massoud Farahmand EMAIL Polytechnique Montréal Mila Quebec AI Institute University of Toronto |
| Pseudocode | Yes | Algorithm 1 DDVI with Auto PI Algorithm 2 Rank-s DDVI with the QR Iteration. Algorithm 3 Rank-s DDTD with QR iteration |
| Open Source Code | Yes | 2The source code for the experiments can be found at https://github.com/adaptive-agents-lab/ddvi. |
| Open Datasets | Yes | For our experiments, we use the following environments: Maze with 5 5 states and 4 actions, Cliffwalk with 3 7 states with 4 actions, Chain Walk with 50 states with 2 actions, and random Garnet MDPs (Bhatnagar et al., 2009) with 200 states. Garnet We use the Garnet environment as described by Farahmand & Ghavamzadeh (2021); Rakhsha et al. (2022), which is based on Bhatnagar et al. (2009). |
| Dataset Splits | No | The paper uses Markov Decision Process (MDP) environments (Maze, Cliffwalk, Chain Walk, Garnet) which are simulated environments rather than static datasets with predefined splits. Data for these experiments is generated through agent interaction within the environment, and the paper does not specify traditional training/test/validation splits of a pre-collected dataset. |
| Hardware Specification | No | All experiments were carried out on local CPUs. (Section 6) This statement is too general as it does not specify any particular CPU model, generation, or number of cores. It lacks the specific details required for reproducibility. |
| Software Dependencies | No | We use the Implicitly Restarted Arnoldi Method (Lehoucq et al., 1998) from Sci Py package to calculate the eigenvalues and eigenvectors for DDVI. This mentions the 'Sci Py package' but does not provide a specific version number, which is crucial for reproducibility. |
| Experiment Setup | Yes | For our experiments, we use the following environments: Maze with 5 5 states and 4 actions, Cliffwalk with 3 7 states with 4 actions, Chain Walk with 50 states with 2 actions, and random Garnet MDPs (Bhatnagar et al., 2009) with 200 states. The discount factor is set to γ = 0.99 for the comparison of DDVI with different ranks and the DDTD experiments, and γ = 0.995 in other experiments. In all experiments, we set DDVI s α = 0.99. We perform an extensive comparison of DDVI against the prior accelerated VI methods... For PID VI, we set η = 0.05 and ϵ = 10 10. In Anderson VI, we have m = 5. The hyperparamaters of TD Learning and DDTD are given in Tables 1,2, and 3. (Appendix G) |