Robust Reward Design for Markov Decision Processes

Authors: Shuo Wu, Haoxiang Ma, Jie Fu, Shuo Han

JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments on multiple test cases demonstrate that our solution improves robustness compared to the standard approach without incurring significant additional computing costs.
Researcher Affiliation Academia SHUO WU, University of Illinois Chicago, USA HAOXIANG MA, University of Florida, USA JIE FU, University of Florida, USA SHUO HAN, University of Illinois Chicago, USA
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. It refers to MILP formulations and discusses algorithms conceptually.
Open Source Code Yes Code for the numerical experiments is provided at the URL in Footnote1. 1https://github.com/fribuilder/robust-reward-design
Open Datasets No The paper uses custom-designed environments: a 6x6 grid world, a 10x10 grid world, and a probabilistic attack graph. It does not provide access information (links, DOIs, or citations to public repositories) for these or any other datasets.
Dataset Splits No The paper describes simulation environments (grid worlds, attack graph) and their initial conditions, but it does not specify any training, testing, or validation dataset splits, as these are not traditional datasets in the context of the experiments conducted.
Hardware Specification Yes All numerical experiments were performed on a Macbook Air laptop computer with an Apple M2 processor and 8 GB RAM running mac OS Sonoma 14.3.1.
Software Dependencies Yes The interior-point solutions and MILP solutions in different environments are computed using the Python MIP package with Gurobi 11.0.0.
Experiment Setup Yes The allocated resource at each allocated reward must be nonnegative, and the total budget of allocation is 4. ... The parameter τ in (12) reflects the level of rationality of the attacker... We then computed vε(xMILP) and vε(xIP) under different values of ε by solving problem (23). ... The bisection was initialized with lower and upper bounds of 0 and C, respectively, where C is the total reward budget.