Efficient Reinforcement Learning in Probabilistic Reward Machines

Authors: Xiaofeng Lin, Xuezhou Zhang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Complementing our theoretical findings, we show through extensive experiment evaluations that our algorithm indeed outperforms prior methods in various PRM environments. Finally, we conduct experiments to showcase the efficiency of UCBVI-PRM.
Researcher Affiliation Academia Boston University EMAIL
Pseudocode Yes Algorithm 1: UCBVI-PRM, Algorithm 2: bonus, Algorithm 3: Reward-free RL-Explore, Algorithm 4: Rewards-free-Plan
Open Source Code Yes Code https://github.com/Xiaofeng Lin7/UCBVI PRM
Open Datasets No The paper describes two environments, the 'River Swim environment' and the 'warehouse environment', which are custom implementations or standard RL environments configured as described in the text. It does not provide specific access information (link, DOI, formal citation) to a publicly available or open dataset used by the experiments in the traditional sense.
Dataset Splits No The paper describes experiments in reinforcement learning environments where data is generated through agent interaction, rather than using pre-defined datasets with explicit training, validation, or test splits. No such split information is provided.
Hardware Specification No The paper does not provide specific details regarding the hardware used to conduct the experiments, such as GPU or CPU models, memory, or cloud computing specifications.
Software Dependencies No The paper does not specify the version numbers for any software dependencies, libraries, or frameworks used in the implementation or experimentation.
Experiment Setup Yes In our experiment, we tune the exploration coefficient for all algorithms by selecting from a equally large set of options (see Appendix E.2). Specifically, Figures 3(a), 3(b), and 3(c) present the regrets of the agent running in a River Swim MDP with 5 observations and a horizon length of 10, a River Swim MDP with 10 observations and a horizon length of 20, and a River Swim MDP with 15 observations and a horizon length of 30, respectively. Specifically, Figures 4(a), 4(b), and 4(c) present the results of the agent running in a 3 3 warehouse with a horizon length of 9, a 4 4 warehouse with a horizon length of 12, and a 5 5 warehouse with a horizon length of 15, respectively. Moving up, right, down, or left leads to moving in the intended direction with probability 0.7, in each perpendicular direction with probability 0.1, or staying in the same place with probability 0.1. The stay action will result in the robot staying in the same place deterministically.