Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Towards Empowerment Gain through Causal Structure Learning in Model-Based Reinforcement Learning

Authors: Hongye Cao, Fan Feng, Meng Fang, Shaokang Dong, Tianpei Yang, Jing Huo, Yang Gao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate ECL combined with 3 causal discovery methods across 6 environments including both state-based and pixel-based tasks, demonstrating its performance gain compared to other causal MBRL methods, in terms of causal structure discovery, sample efficiency, and asymptotic performance in policy learning.
Researcher Affiliation Academia 1National Key Laboratory for Novel Software Technology, Nanjing University 2University of California, San Diego 3MBZUAI 4University of Liverpool 5School of Intelligence Science and Technology, Nanjing University
Pseudocode Yes Algorithm 1 lists the full pipeline of ECL below.
Open Source Code Yes We provide the source code of ECL in the supplementary material.
Open Datasets Yes Environments. We select 3 different environments for basic experimental evaluation. Chemical (Ke et al., 2021): The task is to discover the causal relationship (Chain, Collider & Full) of chemical ... Manipulation (Wang et al., 2022c): The task is to prove dynamics and policy for difficult settings ... Physical (Ke et al., 2021): a dense mode Physical environment. Furthermore, we also include 3 pixel-based environments of Modified Cartpole (Liu et al., 2024), Robedesk (Wang et al., 2022a) and Deep Mind Control (DMC) (Wang et al., 2022a) for evaluation in latent state environments.
Dataset Splits No The paper mentions generating OOD states: 'To create OOD states, we change object positions in the chemical environment and marker positions in the manipulation environment to unseen values, followed (Wang et al., 2022c).' However, it does not explicitly state specific training, validation, or test dataset splits in terms of percentages or sample counts for reproduction.
Hardware Specification Yes All experiments of this approach are implemented on 2 Intel(R) Xeon(R) Gold 6444Y and 4 NVIDIA RTX A6000 GPUs.
Software Dependencies No In our code, we have utilized the following libraries, each covered by its respective license agreements: Py Torch (BSD 3-Clause New or Revised License) Numpy (BSD 3-Clause New or Revised License) Tensorflow (Apache License 2.0) Robosuite (MIT License) Causal MBRL (MIT License) Open AI Gym (MIT License) Robo Desk (Apache License 2.0) Deep Mind Control (Apache License 2.0). Specific version numbers for these software dependencies are not provided.
Experiment Setup Yes We present the architectures of the proposed method across all environments in Table 3. For all activation functions, the Rectified Linear Unit (Re LU) is employed. Additionally, we summarize the hyperparameters for causal mask learning used in all environments for ECL-Con and ECL-Sco in Table 4. Regarding the other parameter settings, we adhered to the parameter configurations established in CDL (Wang et al., 2022c) and ASR (Huang et al., 2022). Moreover, The policy ฯ€collect is trained with a reward function r = tanh(Pd S j=1 log p(sj t+1|st,at) / p(sj t+1|PAsj )). ... We list the downstream task learning architectures of the proposed method across all environments in Table 5. We outline the parameter configurations for the reward predictor, as well as the settings employed for the cross-entropy method that is applied.