Recurrent networks, hidden states and beliefs in partially observable environments

Authors: Gaspard Lambrechts, Adrien Bolland, Damien Ernst

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we show empirically that recurrent neural networks trained to approximate such value functions internally filter the posterior probability distribution of the current state given the history, called the belief. More precisely, we show that, as a recurrent neural network learns the Q-function, its hidden states become more and more correlated with the beliefs of state variables that are relevant to optimal control. This investigation is conducted in this work by studying the performance of the different agents with regard to the mutual information (MI) between their hidden states and the belief. Section 4 displays the main results obtained for the previously mentioned POMDPs.
Researcher Affiliation Academia Gaspard Lambrechts EMAIL Montefiore Institute, University of Liège; Adrien Bolland EMAIL Montefiore Institute, University of Liège; Damien Ernst EMAIL Montefiore Institute, University of Liège LTCI, Telecom Paris, Institut Polytechnique de Paris. All listed institutions are academic.
Pseudocode Yes The DRQN training procedure is detailed in Algorithm 1. This process, illustrated in Algorithm 2, guarantees that the successive sets S0, . . . , SH have (weighted) samples following the probability distribution b0, . . . , b H defined by equation (8). The MINE algorithm proposes to maximise iϕ(X; Y ) by stochastic gradient ascent over batches from the two sets of samples, as detailed in Algorithm 3.
Open Source Code No The information is not present in the paper. There are no explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets Yes We focus on POMDPs for which the models are known. The benchmark problems chosen are the T-Maze environments (Bakker, 2001) and the Mountain Hike environments (Igl et al., 2018). These are standard and well-cited environments.
Dataset Splits No The paper describes reinforcement learning environments (T-Maze and Mountain Hike) where data is generated through interaction, rather than using fixed datasets with predefined train/test/validation splits. Therefore, the concept of specific dataset splits in the traditional supervised learning sense is not applicable or provided.
Hardware Specification No Computational resources have been provided by the Consortium des Équipements de Calcul Intensif (CÉCI), funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11 and by the Walloon Region. This statement mentions the computing facility but does not provide specific hardware details like GPU/CPU models, processors, or memory specifications.
Software Dependencies No The parameters θ are updated with the Adam algorithm (Kingma & Ba, 2014). The paper mentions the Adam optimizer but does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the implementation.
Experiment Setup Yes The hyperparameters of the DRQN algorithm are given in Table 1 and the hyperparameters of the MINE algorithm are given in Table 2. These tables specify values for RNN layers, hidden state size, replay buffer capacity, target update period, exploration rate, batch size, Adam learning rate, number of epochs, etc.