reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory

Authors: Alexander Levine, Peter Stone, Amy Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems. Code is available at: https://github.com/midi-lab/steel. SIMULATION EXPERIMENTS We test the STEEL algorithm on two toy problems: an infinite-horizon environment inspired by the combination lock environment from Efroni et al. (2022b), and a version of the multi-maze environment from Lamb et al. (2023). ... Results are shown in Table 2.
Researcher Affiliation	Collaboration	Alexander Levine The University of Texas at Austin Peter Stone The University of Texas at Austin Sony AI Amy Zhang The University of Texas at Austin
Pseudocode	Yes	A FULL ALGORITHM The STEEL algorithm is presented here in full as Algorithm 1, with a major subroutine, Cycle Find, split out as Algorithm 2.
Open Source Code	Yes	Code is available at: https://github.com/midi-lab/steel.
Open Datasets	No	We test the STEEL algorithm on two toy problems: an infinite-horizon environment inspired by the combination lock environment from Efroni et al. (2022b), and a version of the multi-maze environment from Lamb et al. (2023).
Dataset Splits	No	When assessing the minimum-range minimal loss encoder as in Equation 140, Levine et al. (2024) include an empirical fudge factor : their protocol returns the minimumrange encoder that achieves a loss within 0.1% of the true minimum loss over all possible encoders. We found that even without this fudge factor, the multistep-inverse methods were still heavily biased towards returning ϕSR, when applied either to Single Prime(p, q); or to Double Prime(p, q) with too-small K or too-few samples. ... To better match the spirit of learning an Ex-BMDP from a single trajectory , we collect the fitting and optimization sets from one single trajectory, one after the other, with each corresponding to half of the trajectory: we regard the total sample complexfity as the length of the entire trajectory.
Hardware Specification	No	The paper describes simulation experiments but does not provide specific hardware details such as GPU/CPU models or other computing infrastructure used for running these simulations.
Software Dependencies	No	The paper does not mention specific software dependencies with version numbers, such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For all experiments, we set δ = ϵ = .05. For the combination lock experiments, we set L = 512, and use the (intentionally loose) upper bounds N = ˆD = K + 10 (= \|S\| + 10) and ˆtmix = 40. For the multi-maze environment, we use N = ˆD = 80 (> \|S\| = 68), and ˆtmix = 300. See Appendix D for how we chose the (loose) bounds ˆtmix tmix. For STEEL, we set N = ˆD = 2q (to match the maximum number of states in Double Prime(p, q) or Single Prime(p, q)); ˆtmix = 1 (because neither Double Prime(p, q) nor Single Prime(p, q) have time-correlated noise); ϵ = 0.49 (because, due to the simple emission distributions Q of Single Prime and Double Prime, a 51% encoder accuracy on each latent state implies a 100% encoder accuracy); and δ = 0.05.