reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting

Authors: Wei Chen, Jiahao Zhang, Haipeng Zhu, Boyan Xu, Zhifeng Hao, Keli Zhang, Junjian Ye, Ruichu Cai

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across 22 diverse tasks within the open-world game Crafter validate the effectiveness of our proposed method. (Abstract)
Researcher Affiliation	Collaboration	1 School of Computer Science, Guangdong University of Technology, Guangzhou, China 2 College of Mathematics and Computer Science, Shantou University, Shantou, China 3 Huawei Noah s Ark Lab, Huawei, Paris, France 4 Peng Cheng Laboratory, Shenzhen, China - The authors have affiliations with academic institutions (Guangdong University of Technology, Shantou University, Peng Cheng Laboratory) and an industry lab (Huawei Noah s Ark Lab), indicating a collaboration.
Pseudocode	Yes	Algorithm 1 Causal-aware LLMs Framework Require: Observations o Ensure: Policy π 1: for i in train epoch do 2: Learning Stage: Learning a causal graph of the current environment information using a LLM from o. 3: Adapting Stage: Updating causal graph G using causal interventions. 4: Acting Stage: Using the updated G to assist in guiding the RL agent s policy learning. 5: end for
Open Source Code	Yes	1The code is available at: https://github.com/DMIRLAB-Group/Causal-aware LLMs
Open Datasets	Yes	We use Crafter environment [Hafner, 2022] and Meta-Llama3-8B-Instruct as our based LLM to evaluate the performance of our framework.
Dataset Splits	No	The paper uses the interactive Crafter environment for evaluation across 22 tasks. It does not refer to a static dataset with explicit training, validation, or test splits in the traditional sense, but rather evaluates an agent's performance within this environment.
Hardware Specification	Yes	All experiments are conducted on an NVIDIA Ge Force 4090 with 24G
Software Dependencies	No	We use Crafter environment [Hafner, 2022] and Meta-Llama3-8B-Instruct as our based LLM to evaluate the performance of our framework. While a specific LLM model is named, the paper does not provide version numbers for any other software libraries, frameworks, or the Crafter environment itself.
Experiment Setup	No	The paper describes the overall framework and stages but does not specify concrete hyperparameters like learning rate, batch size, specific optimizers, or detailed training schedules beyond mentioning evaluation at 1M and 5M steps.