Recurrent networks, hidden states and beliefs in partially observable environments
Authors: Gaspard Lambrechts, Adrien Bolland, Damien Ernst
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we show empirically that recurrent neural networks trained to approximate such value functions internally filter the posterior probability distribution of the current state given the history, called the belief. More precisely, we show that, as a recurrent neural network learns the Q-function, its hidden states become more and more correlated with the beliefs of state variables that are relevant to optimal control. This investigation is conducted in this work by studying the performance of the different agents with regard to the mutual information (MI) between their hidden states and the belief. Section 4 displays the main results obtained for the previously mentioned POMDPs. |
| Researcher Affiliation | Academia | Gaspard Lambrechts EMAIL Montefiore Institute, University of Liège; Adrien Bolland EMAIL Montefiore Institute, University of Liège; Damien Ernst EMAIL Montefiore Institute, University of Liège LTCI, Telecom Paris, Institut Polytechnique de Paris. All listed institutions are academic. |
| Pseudocode | Yes | The DRQN training procedure is detailed in Algorithm 1. This process, illustrated in Algorithm 2, guarantees that the successive sets S0, . . . , SH have (weighted) samples following the probability distribution b0, . . . , b H defined by equation (8). The MINE algorithm proposes to maximise iϕ(X; Y ) by stochastic gradient ascent over batches from the two sets of samples, as detailed in Algorithm 3. |
| Open Source Code | No | The information is not present in the paper. There are no explicit statements about releasing source code or links to a code repository for the methodology described. |
| Open Datasets | Yes | We focus on POMDPs for which the models are known. The benchmark problems chosen are the T-Maze environments (Bakker, 2001) and the Mountain Hike environments (Igl et al., 2018). These are standard and well-cited environments. |
| Dataset Splits | No | The paper describes reinforcement learning environments (T-Maze and Mountain Hike) where data is generated through interaction, rather than using fixed datasets with predefined train/test/validation splits. Therefore, the concept of specific dataset splits in the traditional supervised learning sense is not applicable or provided. |
| Hardware Specification | No | Computational resources have been provided by the Consortium des Équipements de Calcul Intensif (CÉCI), funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11 and by the Walloon Region. This statement mentions the computing facility but does not provide specific hardware details like GPU/CPU models, processors, or memory specifications. |
| Software Dependencies | No | The parameters θ are updated with the Adam algorithm (Kingma & Ba, 2014). The paper mentions the Adam optimizer but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the implementation. |
| Experiment Setup | Yes | The hyperparameters of the DRQN algorithm are given in Table 1 and the hyperparameters of the MINE algorithm are given in Table 2. These tables specify values for RNN layers, hidden state size, replay buffer capacity, target update period, exploration rate, batch size, Adam learning rate, number of epochs, etc. |