Adaptive Incentive Design for Markov Decision Processes with Unknown Rewards
Authors: Haoxiang Ma, Shuo Han, Ahmed Hemida, Charles A kamhoua, Jie Fu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the proposed methods with two sets of examples, one is a probabilistic graph-based MDP and another is a stochastic gridworld 1. For all case studies, the workstation used is powered by Intel i7-11700K and 32GB RAM. In experiments, we constrain the leader to allocate a subset of state-action pairs with side-payments. This constraint is to reduce the computation for gradient calculation (recall the complexity analysis in Section 3.1) and can be removed if more powerful GPUs are used. Figures 3a shows the result of the leader s total payoff(J(ν, x, π)), the leader s expected value(V1(π)), the follower s expected value (V2(π)) given the follower s best response policy and the reward R2( , x), and the value of x 1. When the leader does not offer any side-payment, the leader s expected value is 0.152, and the follower s expected value is 0.531. |
| Researcher Affiliation | Collaboration | Haoxiang Ma EMAIL Department of Electrical and Computer Engineering University of Florida, Shuo Han EMAIL Department of Electrical and Computer Engineering University of Illinois Chicago, Ahmed Hemida EMAIL DEVCOM Army Research Laboratory, Charles Kamhoua EMAIL DEVCOM Army Research Laboratory, Jie Fu EMAIL Department of Electrical and Computer Engineering University of Florida |
| Pseudocode | Yes | Algorithm 1 describes the gradient ascent method. Algorithm 2 describes an adaptive incentive design without knowing P2 s reward function. |
| Open Source Code | Yes | 1Code: https://github.com/alexalvis/Incentive Follower |
| Open Datasets | Yes | The effectiveness of the proposed algorithm is demonstrated using a small probabilistic transition system and a stochastic gridworld. We consider first a small MDP in which the follower has four actions { a , b , c , d }. For clarity, the graph in Fig. 2 only shows the transition given action a where a thick (resp. thin) arrow represents a high (resp. low) transition probability, self-loops are omitted. For example, P(0, a ) = {1 : 0.7, 2 : 0.1, 3 : 0.1, 4 : 0.1}. In the second example, we consider a robot motion planning problem in a stochastic gridworld. Consider a six by six gridworld in Figure 4. |
| Dataset Splits | No | The paper analyzes results on custom-generated environments (a small MDP and a stochastic gridworld) rather than pre-existing datasets that would typically require train/test/validation splits. It mentions "trajectory sample sizes" for gradient estimation, but this refers to the number of samples collected during interaction with the environment, not a split of a static dataset. |
| Hardware Specification | Yes | For all case studies, the workstation used is powered by Intel i7-11700K and 32GB RAM. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers, such as programming languages, libraries, or frameworks. While code is provided, this information is not present in the main text. |
| Experiment Setup | Yes | The discounting factor γ is 0.95 in this example. We consider the cost function of making side-payment in the form of h(x) = c x 1. We conduct the experiment for c {0.1, 0.2, 0.3, 0.4, 0.5}. In our experiments, α is selected to be 0.1. Figure 3b shows the value of leader s total payoff converges over iterations, given different trajectory sample sizes used in the gradient estimates. |