reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Reinforcement Learning in Probabilistic Reward Machines

Authors: Xiaofeng Lin, Xuezhou Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Complementing our theoretical ﬁndings, we show through extensive experiment evaluations that our algorithm indeed outperforms prior methods in various PRM environments. Finally, we conduct experiments to showcase the efﬁciency of UCBVI-PRM.
Researcher Affiliation	Academia	Boston University EMAIL
Pseudocode	Yes	Algorithm 1: UCBVI-PRM, Algorithm 2: bonus, Algorithm 3: Reward-free RL-Explore, Algorithm 4: Rewards-free-Plan
Open Source Code	Yes	Code https://github.com/Xiaofeng Lin7/UCBVI PRM
Open Datasets	No	The paper describes two environments, the 'River Swim environment' and the 'warehouse environment', which are custom implementations or standard RL environments configured as described in the text. It does not provide specific access information (link, DOI, formal citation) to a publicly available or open dataset used by the experiments in the traditional sense.
Dataset Splits	No	The paper describes experiments in reinforcement learning environments where data is generated through agent interaction, rather than using pre-defined datasets with explicit training, validation, or test splits. No such split information is provided.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used to conduct the experiments, such as GPU or CPU models, memory, or cloud computing specifications.
Software Dependencies	No	The paper does not specify the version numbers for any software dependencies, libraries, or frameworks used in the implementation or experimentation.
Experiment Setup	Yes	In our experiment, we tune the exploration coefﬁcient for all algorithms by selecting from a equally large set of options (see Appendix E.2). Speciﬁcally, Figures 3(a), 3(b), and 3(c) present the regrets of the agent running in a River Swim MDP with 5 observations and a horizon length of 10, a River Swim MDP with 10 observations and a horizon length of 20, and a River Swim MDP with 15 observations and a horizon length of 30, respectively. Speciﬁcally, Figures 4(a), 4(b), and 4(c) present the results of the agent running in a 3 3 warehouse with a horizon length of 9, a 4 4 warehouse with a horizon length of 12, and a 5 5 warehouse with a horizon length of 15, respectively. Moving up, right, down, or left leads to moving in the intended direction with probability 0.7, in each perpendicular direction with probability 0.1, or staying in the same place with probability 0.1. The stay action will result in the robot staying in the same place deterministically.