Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Steady-State Planning in Expected Reward Multichain MDPs
Authors: George K. Atia, Andre Beckus, Ismail Alkhouri, Alvaro Velasquez
JAIR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify the theoretical findings of our work using a comprehensive set of numerical experiments performed in various environments. The results demonstrate the correctness of the proposed LPs in yielding policies with provably correct behavior and the scalability of the proposed solutions to large problem sizes. |
| Researcher Affiliation | Collaboration | George K. Atia EMAIL Department of Electrical and Computer Engineering Department of Computer Science University of Central Florida, FL 32816, USA Andre Beckus EMAIL Air Force Research Laboratory, NY 13441, USA Ismail Alkhouri EMAIL Department of Electrical and Computer Engineering University of Central Florida, FL 32816, USA Alvaro Velasquez EMAIL Air Force Research Laboratory, NY 13441, USA |
| Pseudocode | Yes | Algorithm 1 Generation of a policy π ΠCPU Input: LMDP M with specifications Φ L . Output: Stationary policy π ΠCPU which satisfies Φ L . Determine the TSCCs rk(M), k m of M is SConnected = False, C = {.}, A = {.} while is SConnected = False do Solve LP (22) with constraint (23) to get optimal values x sa, y sa, (s, a) S A(s). Compute the support E+ k (x ) of each TSCC corresponding to x (See Theorem 5). if digraph (V + k (x ), E+ k (x )) forms a SCC for every k [m] then compute π using (14) is SConnected = True else find a cut and update C and A end if end while |
| Open Source Code | No | No explicit statement or link for open-source code for the methodology described in this paper was found. |
| Open Datasets | Yes | This Frozen Island scenario is motivated by that found in Open AI Gym s Frozen Lake environment (Brockman et al., 2016). We also experiment with random MDPs constructed from n-node directed Gaussian partition graphs generated using the Network X toolbox (Hagberg et al., 2008). |
| Dataset Splits | No | The paper uses Markov Decision Process (MDP) environments (Frozen Island, Toll Collector, Gaussian partition graphs) for numerical experiments, which are simulation-based and do not typically involve explicit training/test/validation dataset splits in the conventional machine learning sense. Instead, they involve simulating policies within these environments and measuring performance metrics. |
| Hardware Specification | No | The first set of experiments are performed on a standard desktop with 16GB of RAM using the Matlab CVX package for convex optimization (Grant & Boyd, 2014, 2008). We also perform a second set of experiments on a standard desktop of 128GB of RAM using the commercial CPLEX Optimizer, which provides a higher-precision mathematical solver for large-scale linear programming. |
| Software Dependencies | Yes | The first set of experiments are performed on a standard desktop with 16GB of RAM using the Matlab CVX package for convex optimization (Grant & Boyd, 2014, 2008). We also perform a second set of experiments on a standard desktop of 128GB of RAM using the commercial CPLEX Optimizer, which provides a higher-precision mathematical solver for large-scale linear programming. We run simulations of the LP in (22), LP1 (with the positivity constraints) and LP2 (with the flow constraints) for random instances of the Frozen Islands problem. The runtime results are reported in Table 5. For a 64 64 and a 128 128 grid, LP2 (the most complex) is solved in about 20 seconds and 15 minutes, respectively, demonstrating the effectiveness of the developed formulations even for MDPs with over ten thousand states. ... using the CPLEX 12.8 solver. |
| Experiment Setup | Yes | For the first small island we have steady-state specifications (Llog1, [0.25, 1]), (Llog2, [0.25, 1]), (Lcanoe1, [0.05, 1.0]) and reward R( , , Lfish1) = R( , , Lfish2) = 1. Likewise, the second small island has steady-state specifications (Lcanoe2, [0.05, 1.0]), (Lfish1, [0.1, 1.0]), (Lfish2, [0.1, 1.0]) and reward R( , , S \ (Lfish1 Lfish2)) = 0. Because the islands are covered in ice, the agent has a chance of slipping in three possible directions whenever it moves. Specifically, if the agent attempts to go right (left), it has a 90% chance of transitioning to the right (left), and there is a 5% chance of transitioning instead to either of the states above or below it. Similarly, if the agent tries to go up (down), it moves to the states above (below) it with 90% chance, and to the states to the right and left of it with chance 5% each. The constant Ntr is used to control the time of entry into the recurrent sets. Therefore, in our experiments we have set ϵ = 10 4. |