reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Incentive Design for Markov Decision Processes with Unknown Rewards

Authors: Haoxiang Ma, Shuo Han, Ahmed Hemida, Charles A kamhoua, Jie Fu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate the proposed methods with two sets of examples, one is a probabilistic graph-based MDP and another is a stochastic gridworld 1. For all case studies, the workstation used is powered by Intel i7-11700K and 32GB RAM. In experiments, we constrain the leader to allocate a subset of state-action pairs with side-payments. This constraint is to reduce the computation for gradient calculation (recall the complexity analysis in Section 3.1) and can be removed if more powerful GPUs are used. Figures 3a shows the result of the leader s total payoff(J(ν, x, π)), the leader s expected value(V1(π)), the follower s expected value (V2(π)) given the follower s best response policy and the reward R2( , x), and the value of x 1. When the leader does not offer any side-payment, the leader s expected value is 0.152, and the follower s expected value is 0.531.
Researcher Affiliation	Collaboration	Haoxiang Ma EMAIL Department of Electrical and Computer Engineering University of Florida, Shuo Han EMAIL Department of Electrical and Computer Engineering University of Illinois Chicago, Ahmed Hemida EMAIL DEVCOM Army Research Laboratory, Charles Kamhoua EMAIL DEVCOM Army Research Laboratory, Jie Fu EMAIL Department of Electrical and Computer Engineering University of Florida
Pseudocode	Yes	Algorithm 1 describes the gradient ascent method. Algorithm 2 describes an adaptive incentive design without knowing P2 s reward function.
Open Source Code	Yes	1Code: https://github.com/alexalvis/Incentive Follower
Open Datasets	Yes	The effectiveness of the proposed algorithm is demonstrated using a small probabilistic transition system and a stochastic gridworld. We consider first a small MDP in which the follower has four actions { a , b , c , d }. For clarity, the graph in Fig. 2 only shows the transition given action a where a thick (resp. thin) arrow represents a high (resp. low) transition probability, self-loops are omitted. For example, P(0, a ) = {1 : 0.7, 2 : 0.1, 3 : 0.1, 4 : 0.1}. In the second example, we consider a robot motion planning problem in a stochastic gridworld. Consider a six by six gridworld in Figure 4.
Dataset Splits	No	The paper analyzes results on custom-generated environments (a small MDP and a stochastic gridworld) rather than pre-existing datasets that would typically require train/test/validation splits. It mentions "trajectory sample sizes" for gradient estimation, but this refers to the number of samples collected during interaction with the environment, not a split of a static dataset.
Hardware Specification	Yes	For all case studies, the workstation used is powered by Intel i7-11700K and 32GB RAM.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers, such as programming languages, libraries, or frameworks. While code is provided, this information is not present in the main text.
Experiment Setup	Yes	The discounting factor γ is 0.95 in this example. We consider the cost function of making side-payment in the form of h(x) = c x 1. We conduct the experiment for c {0.1, 0.2, 0.3, 0.4, 0.5}. In our experiments, α is selected to be 0.1. Figure 3b shows the value of leader s total payoff converges over iterations, given different trajectory sample sizes used in the gradient estimates.