reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Agents Meet Causality -- Bridging LLMs and Causal World Models

Authors: John Gkountouras, Matthias Lindemann, Phillip Lippe, Efstratios Gavves, Ivan Titov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities. Our experiments demonstrate the effectiveness of the approach, with the causally-aware method outperforming LLM-based reasoners, especially for longer planning horizons.
Researcher Affiliation	Academia	1Institute for Logic, Language and Computation (ILLC), University of Amsterdam 2Institute for Language, Cognition and Computation (ILCC), University of Edinburgh 3QUVA Lab, University of Amsterdam 4Archimedes/Athena RC, Greece EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Inference with the Causal World Model ... Algorithm 2 Causally-Aware MCTS
Open Source Code	Yes	For reproducibility, we publish the code and models to integrate Causal Representation Learning (CRL) with Language Models (LLMs), as well as the scripts to generate data sets used in our experiments, on our code repository: https://github.com/j0hngou/LLWCM/.
Open Datasets	Yes	For reproducibility, we publish the code and models to integrate Causal Representation Learning (CRL) with Language Models (LLMs), as well as the scripts to generate data sets used in our experiments, on our code repository: https://github.com/j0hngou/LLWCM/.
Dataset Splits	Yes	For each environment, we generated multiple datasets as shown in Table 5. Table 5: Dataset specifications for each environment: Training 10000 trajectories of 100 steps Used for model training; Validation 1000 episodes of 100 steps Used for model validation; Test 1000 episodes of 100 steps Used for final evaluation; ICL 100 episodes of 100 steps Used for in-context learning; N-step evaluation 100 episodes of 100 steps each, Used for N-step experiments for each N value
Hardware Specification	Yes	In terms of computational resources, all experiments were performed on NVIDIA A100 GPUs.
Software Dependencies	No	All models were implemented using Py Torch (Paszke et al., 2019) and Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019). For the Gridworld environment, we implement an autoencoder with 40 latent dimensions and 64 hidden channels. Both the encoder and decoder consist of 2 residual blocks with Si LU activation functions. We incorporate the Coord Conv operator (Liu et al., 2018) to better capture coordinate information from images. For the i THOR environment, we employ the autoencoder architecture from BISCUIT (Lippe et al., 2023). For both the normalizing flow and transition model, we use the same architectures and hyperparameters as in BISCUIT (Lippe et al., 2023) as it has demonstrated strong performance in identifying causal variables from high-dimensional observations. The text encoder for the Gridworld environment is based on a pretrained Sentence Transformer (Reimers & Gurevych, 2019), specifically the all-Mini LM-L6-v2 model, augmented with a 2-layer MLP head with 64 hidden dimensions. For i THOR, we use a pretrained Sig LIP model (Zhai et al., 2023) with a similar 2-layer MLP head.
Experiment Setup	Yes	For Gridworld, we use a learning rate of 3 10 3 for the main model and 3 10 3 for the text MLP, batch size of 384, and train for 300 epochs. For i THOR, we use a learning rate of 1 10 3 for the main model and 3 10 3 for the text MLP, batch size of 64, and train for 100 epochs. Both environments employ a warmup period of 100 steps and a sequence length of 2 for training.