Composition and Zero-Shot Transfer with Lattice Structures in Reinforcement Learning

Authors: Geraud Nangue Tasse, Steven James, Benjamin Rosman

JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our approach in high-dimensional domains including a video game environment and continuous-control task where an agent first learns to solve a set of base tasks, and then composes these solutions to solve a super-exponential number of new tasks. We illustrate our approach in gridworld domains, where an agent first learns to navigate to particular regions, after which it can then optimally solve any task specified as their logical combination. We then demonstrate composition in a high-dimensional video game environment, where an agent first learns to collect different objects, and then compose these abilities to solve complex tasks immediately. We also apply our approach to a highdimensional, continuous control task, demonstrating applicability to domains with low-level continuous actions. Our results show that, even when function approximation is required, an agent can leverage its existing skills to solve new tasks without further learning.1
Researcher Affiliation Academia University of the Witwatersrand Machine Intelligence and Neural Discovery (MIND) Institute 1 Jan Smuts Avenue, Johannesburg, 2000
Pseudocode Yes Algorithm 1: Q-learning for WVFs... Algorithm 2: Goal-oriented learning
Open Source Code No The paper does not explicitly state that code for the described methodology is open-source or provide a link to a repository.
Open Datasets Yes Consider the Four Rooms domain [Sutton et al., 1999], where an agent must navigate a gridworld to particular rooms. We use the same video game environment as van Niekerk et al. [2019], where the observations are images of the 2D game world and the agent must navigate to collect objects of different shapes and colours. We consider a continuous 3D Four Rooms environment where the ant robot of Duan et al. [2016] must navigate to the center of specific rooms. The environment is simulated in Mu Jo Co [Todorov et al., 2012]
Dataset Splits No The paper mentions evaluating policies over a number of episodes (e.g., "averaging results over 1000 episodes"), which is common in RL, but does not specify traditional training/test/validation dataset splits for a fixed dataset.
Hardware Specification No Computations were performed using the High Performance Computing Infrastructure provided by the Mathematical Sciences Support unit at the University of the Witwatersrand.
Software Dependencies No The paper mentions using specific optimizers (ADAM) and algorithms (soft actor-critic with automated entropy adjustments), but does not provide version numbers for any software libraries (e.g., PyTorch, TensorFlow) or environments (MuJoCo) used for implementation.
Experiment Setup Yes To train the world value functions in the 2D video game environment, we use a neural network with the following architecture: 1. Three convolutional layers: (a) Layer 1 has 6 input channels, 32 output channels, a kernel size of 8 and a stride of 4. (b) Layer 2 has 32 input channels, 64 output channels, a kernel size of 4 and a stride of 2. (c) Layer 3 has 64 input channels, 64 output channels, a kernel size of 3 and a stride of 1. 2. Two fully-connected linear layers: (a) Layer 4 has input size 3136 and output size 512 and uses a Re LU activation function. (b) Layer 5 has input size 512 and output size 5 with no activation function. We use the ADAM optimiser with mini-batch size 32 and a learning rate of 10-4. We train every 4 timesteps and update the target Q-network every 1000 steps. Finally, we use ϵ-greedy exploration, annealing ϵ from 1 to 0.01 over 100000 timesteps.