Adaptive Incentive Design for Markov Decision Processes with Unknown Rewards

Authors: Haoxiang Ma, Shuo Han, Ahmed Hemida, Charles A kamhoua, Jie Fu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the proposed methods with two sets of examples, one is a probabilistic graph-based MDP and another is a stochastic gridworld 1. For all case studies, the workstation used is powered by Intel i7-11700K and 32GB RAM. In experiments, we constrain the leader to allocate a subset of state-action pairs with side-payments. This constraint is to reduce the computation for gradient calculation (recall the complexity analysis in Section 3.1) and can be removed if more powerful GPUs are used. Figures 3a shows the result of the leader s total payoff(J(ν, x, π)), the leader s expected value(V1(π)), the follower s expected value (V2(π)) given the follower s best response policy and the reward R2( , x), and the value of x 1. When the leader does not offer any side-payment, the leader s expected value is 0.152, and the follower s expected value is 0.531.
Researcher Affiliation Collaboration Haoxiang Ma EMAIL Department of Electrical and Computer Engineering University of Florida, Shuo Han EMAIL Department of Electrical and Computer Engineering University of Illinois Chicago, Ahmed Hemida EMAIL DEVCOM Army Research Laboratory, Charles Kamhoua EMAIL DEVCOM Army Research Laboratory, Jie Fu EMAIL Department of Electrical and Computer Engineering University of Florida
Pseudocode Yes Algorithm 1 describes the gradient ascent method. Algorithm 2 describes an adaptive incentive design without knowing P2 s reward function.
Open Source Code Yes 1Code: https://github.com/alexalvis/Incentive Follower
Open Datasets Yes The effectiveness of the proposed algorithm is demonstrated using a small probabilistic transition system and a stochastic gridworld. We consider first a small MDP in which the follower has four actions { a , b , c , d }. For clarity, the graph in Fig. 2 only shows the transition given action a where a thick (resp. thin) arrow represents a high (resp. low) transition probability, self-loops are omitted. For example, P(0, a ) = {1 : 0.7, 2 : 0.1, 3 : 0.1, 4 : 0.1}. In the second example, we consider a robot motion planning problem in a stochastic gridworld. Consider a six by six gridworld in Figure 4.
Dataset Splits No The paper analyzes results on custom-generated environments (a small MDP and a stochastic gridworld) rather than pre-existing datasets that would typically require train/test/validation splits. It mentions "trajectory sample sizes" for gradient estimation, but this refers to the number of samples collected during interaction with the environment, not a split of a static dataset.
Hardware Specification Yes For all case studies, the workstation used is powered by Intel i7-11700K and 32GB RAM.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers, such as programming languages, libraries, or frameworks. While code is provided, this information is not present in the main text.
Experiment Setup Yes The discounting factor γ is 0.95 in this example. We consider the cost function of making side-payment in the form of h(x) = c x 1. We conduct the experiment for c {0.1, 0.2, 0.3, 0.4, 0.5}. In our experiments, α is selected to be 0.1. Figure 3b shows the value of leader s total payoff converges over iterations, given different trajectory sample sizes used in the gradient estimates.