Learning to Steer Markovian Agents under Model Uncertainty

Authors: Jiawei Huang, Vinzenz Thoma, Zebang Shen, Heinrich Nax, Niao He

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical Validation: In Sec. 6, we evaluate our algorithms in various representative environments, and demonstrate their effectiveness under model uncertainty.
Researcher Affiliation Academia Jiawei Huang:, Vinzenz Thoma;, Zebang Shen:, Heinrich H. Nax , Niao He: : Department of Computer Science, ETH Zurich EMAIL ; ETH AI Center EMAIL University of Zurich EMAIL
Pseudocode Yes Procedure 1: The Steering Procedure when |F| is Small; Procedure 2: The Steering Procedure when |F| is Large (The FETE Framework); Algorithm 3: Learning with Known Steering Dynamics; Algorithm 4: Solving Obj. (1) by Learning Belief State-Dependent Strategy
Open Source Code Yes REPRODUCIBILITY STATEMENT The code of all the experiments in this paper and the instructions for running can be found in https://github.com/jiaweihhuang/Steering_Markovian_Agents.
Open Datasets No The paper describes experiments on 'Normal-Form Stag Hunt Game', 'Grid World Stag Hunt Game', and 'Matching Pennies', which are game environments or theoretical setups. It does not mention using any external, publicly available datasets with concrete access information like links or citations to specific datasets.
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages, sample counts for train/test/validation sets). It mentions initial policy generation for evaluation (e.g., 'averaged over 5x5 uniformly distributed grids as initializations of π1'), but this pertains to initial conditions for simulations rather than data splits.
Hardware Specification Yes G.5 A SUMMARY OF THE COMPUTE RESOURCES BY EXPERIMENTS IN THIS PAPER Experiments on Two-Player Normal-Form Games For the experiments in Stag Hunt and Matching Pennies (illustrated in Fig. 1, 5, 6), we only use CPUs (AMD EPYC 7742 64-Core Processor). It takes less than 5 hours to finish the training. Experiments on Grid-World Version of Stag Hunt For the experiments in grid-world Stag Hunt (illustrated in Fig. 2), we use one RTX 3090 and less than 5 CPUs (AMD EPYC 7742 64-Core Processor).
Software Dependencies No The paper mentions software like 'PPO implementation of Stable Baseline3 (Raffin et al., 2021)' but does not specify version numbers for these software components or libraries, which is required for a reproducible description.
Experiment Setup Yes Both agents follow the exact NPG (Def. 4.1 with p Aπ Aπ) with fixed learning rate α 0.01. For the steering setup, we choose the total utility as ηgoal, and use PPO to train the steering strategy [...]. The maximal steering reward Umax is set to be 10, and we choose β 25. [...] The agents adopt a CNN, and utilize PPO to optimize the CNN parameters with learning rate 0.005. [...] We choose β 25 and learning rate 0.001. [...] We set Umax 1.0.