Robust Reward Design for Markov Decision Processes
Authors: Shuo Wu, Haoxiang Ma, Jie Fu, Shuo Han
JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments on multiple test cases demonstrate that our solution improves robustness compared to the standard approach without incurring significant additional computing costs. |
| Researcher Affiliation | Academia | SHUO WU, University of Illinois Chicago, USA HAOXIANG MA, University of Florida, USA JIE FU, University of Florida, USA SHUO HAN, University of Illinois Chicago, USA |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. It refers to MILP formulations and discusses algorithms conceptually. |
| Open Source Code | Yes | Code for the numerical experiments is provided at the URL in Footnote1. 1https://github.com/fribuilder/robust-reward-design |
| Open Datasets | No | The paper uses custom-designed environments: a 6x6 grid world, a 10x10 grid world, and a probabilistic attack graph. It does not provide access information (links, DOIs, or citations to public repositories) for these or any other datasets. |
| Dataset Splits | No | The paper describes simulation environments (grid worlds, attack graph) and their initial conditions, but it does not specify any training, testing, or validation dataset splits, as these are not traditional datasets in the context of the experiments conducted. |
| Hardware Specification | Yes | All numerical experiments were performed on a Macbook Air laptop computer with an Apple M2 processor and 8 GB RAM running mac OS Sonoma 14.3.1. |
| Software Dependencies | Yes | The interior-point solutions and MILP solutions in different environments are computed using the Python MIP package with Gurobi 11.0.0. |
| Experiment Setup | Yes | The allocated resource at each allocated reward must be nonnegative, and the total budget of allocation is 4. ... The parameter τ in (12) reflects the level of rationality of the attacker... We then computed vε(xMILP) and vε(xIP) under different values of ε by solving problem (23). ... The bisection was initialized with lower and upper bounds of 0 and C, respectively, where C is the total reward budget. |