Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory

Authors: Alexander Levine, Peter Stone, Amy Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems. Code is available at: https://github.com/midi-lab/steel. SIMULATION EXPERIMENTS We test the STEEL algorithm on two toy problems: an infinite-horizon environment inspired by the combination lock environment from Efroni et al. (2022b), and a version of the multi-maze environment from Lamb et al. (2023). ... Results are shown in Table 2.
Researcher Affiliation Collaboration Alexander Levine The University of Texas at Austin Peter Stone The University of Texas at Austin Sony AI Amy Zhang The University of Texas at Austin
Pseudocode Yes A FULL ALGORITHM The STEEL algorithm is presented here in full as Algorithm 1, with a major subroutine, Cycle Find, split out as Algorithm 2.
Open Source Code Yes Code is available at: https://github.com/midi-lab/steel.
Open Datasets No We test the STEEL algorithm on two toy problems: an infinite-horizon environment inspired by the combination lock environment from Efroni et al. (2022b), and a version of the multi-maze environment from Lamb et al. (2023).
Dataset Splits No When assessing the minimum-range minimal loss encoder as in Equation 140, Levine et al. (2024) include an empirical fudge factor : their protocol returns the minimumrange encoder that achieves a loss within 0.1% of the true minimum loss over all possible encoders. We found that even without this fudge factor, the multistep-inverse methods were still heavily biased towards returning ϕSR, when applied either to Single Prime(p, q); or to Double Prime(p, q) with too-small K or too-few samples. ... To better match the spirit of learning an Ex-BMDP from a single trajectory , we collect the fitting and optimization sets from one single trajectory, one after the other, with each corresponding to half of the trajectory: we regard the total sample complexfity as the length of the entire trajectory.
Hardware Specification No The paper describes simulation experiments but does not provide specific hardware details such as GPU/CPU models or other computing infrastructure used for running these simulations.
Software Dependencies No The paper does not mention specific software dependencies with version numbers, such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For all experiments, we set δ = ϵ = .05. For the combination lock experiments, we set L = 512, and use the (intentionally loose) upper bounds N = ˆD = K + 10 (= |S| + 10) and ˆtmix = 40. For the multi-maze environment, we use N = ˆD = 80 (> |S| = 68), and ˆtmix = 300. See Appendix D for how we chose the (loose) bounds ˆtmix tmix. For STEEL, we set N = ˆD = 2q (to match the maximum number of states in Double Prime(p, q) or Single Prime(p, q)); ˆtmix = 1 (because neither Double Prime(p, q) nor Single Prime(p, q) have time-correlated noise); ϵ = 0.49 (because, due to the simple emission distributions Q of Single Prime and Double Prime, a 51% encoder accuracy on each latent state implies a 100% encoder accuracy); and δ = 0.05.