Stable Offline Value Function Learning with Bisimulation-based Representations
Authors: Brahma S Pavse, Yudong Chen, Qiaomin Xie, Josiah P. Hanna
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate that KROPE representations improve the stability and accuracy of offline value function learning algorithms on 10/13 offline datasets and against 7 baselines (Section 4). Our empirical analysis also shows that KROPE is robust to hyperparameter tuning, which is a critical property for practical offline policy evaluation. We empirically analyze the sensitivity of the KROPE learning procedure under the deadly triad. These experiments shed light on when representation learning may be easier than value function learning (Section 4.4). |
| Researcher Affiliation | Academia | 1University of Wisconsin Madison. Correspondence to: Brahma S. Pavse <EMAIL>. |
| Pseudocode | Yes | In Appendix A, we include the pseudo-code for LSPE. The KROPE learning algorithm uses an encoder ϕω : S A Rd, which is parameterized by weights ω of a function approximator. We include the pseudocode of KROPE in Appendix A. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its own source code, nor does it provide a direct link to a repository containing the implementation of KROPE or the experiments described. Mentions of GitHub links refer to third-party policies or environments used for dataset generation. |
| Open Datasets | Yes | Domains We conduct our evaluation on a variety of domains: 1) Garnet MDPs, which are a class of tabular stochastic MDPs that are randomly generated given a fixed number of states and actions (Archibald et al., 1995); 2) 4 DM Control environments: Cart Pole Swing Up, Cheetah Run, Finger Easy, Walker Stand (Tassa et al., 2018); and 3) 9 D4RL datasets (Fu et al., 2020; 2021). |
| Dataset Splits | No | The paper describes using a "fixed dataset of m transition tuples D" for offline policy evaluation. It details how custom datasets were generated and the size (e.g., "100K transitions"), but it does not specify explicit training, validation, or test splits from these datasets for the models being learned. The evaluation is typically performed on the dataset D itself, comparing estimated value functions to true value functions. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions general training details for neural networks. |
| Software Dependencies | No | The paper mentions software components like "Adam optimizer" and "gym2" environments, but it does not specify version numbers for any of these components or other libraries/frameworks (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | General Training Details. In all the continuous state-action experiments, we use a neural network with 1 layer and 1024 neurons using RELU activation function and layernorm to represent the encoder ϕ : X Rd (Gallici et al., 2025). We use mini-batch gradient descent to train the network with mini-batch sizes of 2048 and for 500 epochs, where a single epoch is a pass over the full dataset. We use the Adam optimizer with learning rate {1e 5, 2e 5, 5e 5} and weight decay 1e 2. The target network is updated with a hard update after every epoch. The output dimension d is {|X|/4, |X|/2, 3|X|/4}, where |X| is the dimension of the original state-action space of the environment. All our results involve analyzing this learned ϕ. Since FQE outputs a scalar, we add a linear layer on top of the d-dimensional vector to output a scalar. The entire network is then trained end-to-end. The discount factor is γ = 0.99. The auxiliary task weight with FQE for all representation learning algorithms is α = 0.1. When using LSPE for OPE, we invert the covariance matrix by computing the pseudoinverse. In the tabular environments, we use a similar setup as above. The only changes are that we use a linear network with a bias component but no activation function and fix the learning rate to be 1e 3. For the experiment in Section 4.4, α = 0.8. |