Language Agents Meet Causality -- Bridging LLMs and Causal World Models
Authors: John Gkountouras, Matthias Lindemann, Phillip Lippe, Efstratios Gavves, Ivan Titov
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities. Our experiments demonstrate the effectiveness of the approach, with the causally-aware method outperforming LLM-based reasoners, especially for longer planning horizons. |
| Researcher Affiliation | Academia | 1Institute for Logic, Language and Computation (ILLC), University of Amsterdam 2Institute for Language, Cognition and Computation (ILCC), University of Edinburgh 3QUVA Lab, University of Amsterdam 4Archimedes/Athena RC, Greece EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Inference with the Causal World Model ... Algorithm 2 Causally-Aware MCTS |
| Open Source Code | Yes | For reproducibility, we publish the code and models to integrate Causal Representation Learning (CRL) with Language Models (LLMs), as well as the scripts to generate data sets used in our experiments, on our code repository: https://github.com/j0hngou/LLWCM/. |
| Open Datasets | Yes | For reproducibility, we publish the code and models to integrate Causal Representation Learning (CRL) with Language Models (LLMs), as well as the scripts to generate data sets used in our experiments, on our code repository: https://github.com/j0hngou/LLWCM/. |
| Dataset Splits | Yes | For each environment, we generated multiple datasets as shown in Table 5. Table 5: Dataset specifications for each environment: Training 10000 trajectories of 100 steps Used for model training; Validation 1000 episodes of 100 steps Used for model validation; Test 1000 episodes of 100 steps Used for final evaluation; ICL 100 episodes of 100 steps Used for in-context learning; N-step evaluation 100 episodes of 100 steps each, Used for N-step experiments for each N value |
| Hardware Specification | Yes | In terms of computational resources, all experiments were performed on NVIDIA A100 GPUs. |
| Software Dependencies | No | All models were implemented using Py Torch (Paszke et al., 2019) and Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019). For the Gridworld environment, we implement an autoencoder with 40 latent dimensions and 64 hidden channels. Both the encoder and decoder consist of 2 residual blocks with Si LU activation functions. We incorporate the Coord Conv operator (Liu et al., 2018) to better capture coordinate information from images. For the i THOR environment, we employ the autoencoder architecture from BISCUIT (Lippe et al., 2023). For both the normalizing flow and transition model, we use the same architectures and hyperparameters as in BISCUIT (Lippe et al., 2023) as it has demonstrated strong performance in identifying causal variables from high-dimensional observations. The text encoder for the Gridworld environment is based on a pretrained Sentence Transformer (Reimers & Gurevych, 2019), specifically the all-Mini LM-L6-v2 model, augmented with a 2-layer MLP head with 64 hidden dimensions. For i THOR, we use a pretrained Sig LIP model (Zhai et al., 2023) with a similar 2-layer MLP head. |
| Experiment Setup | Yes | For Gridworld, we use a learning rate of 3 10 3 for the main model and 3 10 3 for the text MLP, batch size of 384, and train for 300 epochs. For i THOR, we use a learning rate of 1 10 3 for the main model and 3 10 3 for the text MLP, batch size of 64, and train for 100 epochs. Both environments employ a warmup period of 100 steps and a sequence length of 2 for training. |