reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Stable Offline Value Function Learning with Bisimulation-based Representations

Authors: Brahma S Pavse, Yudong Chen, Qiaomin Xie, Josiah P. Hanna

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate that KROPE representations improve the stability and accuracy of offline value function learning algorithms on 10/13 offline datasets and against 7 baselines (Section 4). Our empirical analysis also shows that KROPE is robust to hyperparameter tuning, which is a critical property for practical offline policy evaluation. We empirically analyze the sensitivity of the KROPE learning procedure under the deadly triad. These experiments shed light on when representation learning may be easier than value function learning (Section 4.4).
Researcher Affiliation	Academia	1University of Wisconsin Madison. Correspondence to: Brahma S. Pavse <EMAIL>.
Pseudocode	Yes	In Appendix A, we include the pseudo-code for LSPE. The KROPE learning algorithm uses an encoder ϕω : S A Rd, which is parameterized by weights ω of a function approximator. We include the pseudocode of KROPE in Appendix A.
Open Source Code	No	The paper does not contain an explicit statement about releasing its own source code, nor does it provide a direct link to a repository containing the implementation of KROPE or the experiments described. Mentions of GitHub links refer to third-party policies or environments used for dataset generation.
Open Datasets	Yes	Domains We conduct our evaluation on a variety of domains: 1) Garnet MDPs, which are a class of tabular stochastic MDPs that are randomly generated given a fixed number of states and actions (Archibald et al., 1995); 2) 4 DM Control environments: Cart Pole Swing Up, Cheetah Run, Finger Easy, Walker Stand (Tassa et al., 2018); and 3) 9 D4RL datasets (Fu et al., 2020; 2021).
Dataset Splits	No	The paper describes using a "fixed dataset of m transition tuples D" for offline policy evaluation. It details how custom datasets were generated and the size (e.g., "100K transitions"), but it does not specify explicit training, validation, or test splits from these datasets for the models being learned. The evaluation is typically performed on the dataset D itself, comparing estimated value functions to true value functions.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions general training details for neural networks.
Software Dependencies	No	The paper mentions software components like "Adam optimizer" and "gym2" environments, but it does not specify version numbers for any of these components or other libraries/frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	General Training Details. In all the continuous state-action experiments, we use a neural network with 1 layer and 1024 neurons using RELU activation function and layernorm to represent the encoder ϕ : X Rd (Gallici et al., 2025). We use mini-batch gradient descent to train the network with mini-batch sizes of 2048 and for 500 epochs, where a single epoch is a pass over the full dataset. We use the Adam optimizer with learning rate {1e 5, 2e 5, 5e 5} and weight decay 1e 2. The target network is updated with a hard update after every epoch. The output dimension d is {\|X\|/4, \|X\|/2, 3\|X\|/4}, where \|X\| is the dimension of the original state-action space of the environment. All our results involve analyzing this learned ϕ. Since FQE outputs a scalar, we add a linear layer on top of the d-dimensional vector to output a scalar. The entire network is then trained end-to-end. The discount factor is γ = 0.99. The auxiliary task weight with FQE for all representation learning algorithms is α = 0.1. When using LSPE for OPE, we invert the covariance matrix by computing the pseudoinverse. In the tabular environments, we use a similar setup as above. The only changes are that we use a linear network with a bias component but no activation function and fix the learning rate to be 1e 3. For the experiment in Section 4.4, α = 0.8.