Rank-One Modified Value Iteration

Authors: Arman Sharifi Kolarijani, Tolga Ok, Peyman Mohajerin Esfahani, Mohamad Amin Sharifi Kolarijani

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through our extensive numerical simulations, however, we show that the proposed algorithm consistently outperforms first-order algorithms and their accelerated versions for both planning and learning problems. In Section 5, we provide the results of our extensive numerical simulations and compare the proposed algorithms with a range of existing algorithms for solving the optimal control problem of MDPs.
Researcher Affiliation Academia 1Delft University of Technology, The Netherlands 2University of Toronto, Canada. Correspondence to: Arman S. Kolarijani <EMAIL>.
Pseudocode Yes Algorithm 1 Rank-One Value Iteration (R1-VI) Algorithm 2 Rank-One Q-Learning (R1-QL)
Open Source Code No The paper describes its proposed algorithms (R1-VI and R1-QL) and mentions adapting update rules for existing algorithms, but it does not provide any explicit statement about releasing its own implementation code, nor does it provide a link to a code repository.
Open Datasets Yes The experiments are conducted on Garnet (Archibald et al., 1995) and Graph MDPs (Devraj & Meyn, 2017), focusing on the Bellman errors T(vk) vk and T(qk) qk and the value errors vk v and qk q .
Dataset Splits No The paper describes the generation of MDP instances for Garnet (e.g., "25 randomly generated instances of Garnet MDPs") and refers to configurations from cited works for Graph MDPs. However, it does not specify explicit training/testing/validation splits for a fixed dataset, such as percentages, sample counts, or references to standard benchmark splits for data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the numerical simulations, such as GPU models, CPU types, or memory configurations. It only generally mentions "numerical simulations" and "experiments".
Software Dependencies No The paper mentions that update rules were adapted from a specific source for implementation ("We adapt the update rules provided by (Kolarijani & Mohajerin Esfahani, 2023) in our implementations for the planning algorithms."), but it does not specify any software libraries, frameworks, or tools with their version numbers (e.g., Python version, PyTorch/TensorFlow version, or specific solvers).
Experiment Setup Yes All the learning algorithms use the same samples generated through the training. Besides the proposed R1-QL Algorithm 2, we report the performance of Speedy QL (Ghavamzadeh et al., 2011), Zap QL (Devraj & Meyn, 2017), and the standard QL (5) in Garnet and Graph MDPs for several discount factors; see Appendix B.1 for the update rule of Speedy QL and Zap QL. We run each algorithm using the same step-size schedule (λk) k=0, namely, linearly decaying λk = 1/(1 + k), to ensure a fair comparison.