reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Steer Markovian Agents under Model Uncertainty

Authors: Jiawei Huang, Vinzenz Thoma, Zebang Shen, Heinrich Nax, Niao He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical Validation: In Sec. 6, we evaluate our algorithms in various representative environments, and demonstrate their effectiveness under model uncertainty.
Researcher Affiliation	Academia	Jiawei Huang:, Vinzenz Thoma;, Zebang Shen:, Heinrich H. Nax , Niao He: : Department of Computer Science, ETH Zurich EMAIL ; ETH AI Center EMAIL University of Zurich EMAIL
Pseudocode	Yes	Procedure 1: The Steering Procedure when \|F\| is Small; Procedure 2: The Steering Procedure when \|F\| is Large (The FETE Framework); Algorithm 3: Learning with Known Steering Dynamics; Algorithm 4: Solving Obj. (1) by Learning Belief State-Dependent Strategy
Open Source Code	Yes	REPRODUCIBILITY STATEMENT The code of all the experiments in this paper and the instructions for running can be found in https://github.com/jiaweihhuang/Steering_Markovian_Agents.
Open Datasets	No	The paper describes experiments on 'Normal-Form Stag Hunt Game', 'Grid World Stag Hunt Game', and 'Matching Pennies', which are game environments or theoretical setups. It does not mention using any external, publicly available datasets with concrete access information like links or citations to specific datasets.
Dataset Splits	No	The paper does not provide specific dataset split information (e.g., percentages, sample counts for train/test/validation sets). It mentions initial policy generation for evaluation (e.g., 'averaged over 5x5 uniformly distributed grids as initializations of π1'), but this pertains to initial conditions for simulations rather than data splits.
Hardware Specification	Yes	G.5 A SUMMARY OF THE COMPUTE RESOURCES BY EXPERIMENTS IN THIS PAPER Experiments on Two-Player Normal-Form Games For the experiments in Stag Hunt and Matching Pennies (illustrated in Fig. 1, 5, 6), we only use CPUs (AMD EPYC 7742 64-Core Processor). It takes less than 5 hours to finish the training. Experiments on Grid-World Version of Stag Hunt For the experiments in grid-world Stag Hunt (illustrated in Fig. 2), we use one RTX 3090 and less than 5 CPUs (AMD EPYC 7742 64-Core Processor).
Software Dependencies	No	The paper mentions software like 'PPO implementation of Stable Baseline3 (Raffin et al., 2021)' but does not specify version numbers for these software components or libraries, which is required for a reproducible description.
Experiment Setup	Yes	Both agents follow the exact NPG (Def. 4.1 with p Aπ Aπ) with fixed learning rate α 0.01. For the steering setup, we choose the total utility as ηgoal, and use PPO to train the steering strategy [...]. The maximal steering reward Umax is set to be 10, and we choose β 25. [...] The agents adopt a CNN, and utilize PPO to optimize the CNN parameters with learning rate 0.005. [...] We choose β 25 and learning rate 0.001. [...] We set Umax 1.0.