Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Towards Empowerment Gain through Causal Structure Learning in Model-Based Reinforcement Learning
Authors: Hongye Cao, Fan Feng, Meng Fang, Shaokang Dong, Tianpei Yang, Jing Huo, Yang Gao
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate ECL combined with 3 causal discovery methods across 6 environments including both state-based and pixel-based tasks, demonstrating its performance gain compared to other causal MBRL methods, in terms of causal structure discovery, sample efficiency, and asymptotic performance in policy learning. |
| Researcher Affiliation | Academia | 1National Key Laboratory for Novel Software Technology, Nanjing University 2University of California, San Diego 3MBZUAI 4University of Liverpool 5School of Intelligence Science and Technology, Nanjing University |
| Pseudocode | Yes | Algorithm 1 lists the full pipeline of ECL below. |
| Open Source Code | Yes | We provide the source code of ECL in the supplementary material. |
| Open Datasets | Yes | Environments. We select 3 different environments for basic experimental evaluation. Chemical (Ke et al., 2021): The task is to discover the causal relationship (Chain, Collider & Full) of chemical ... Manipulation (Wang et al., 2022c): The task is to prove dynamics and policy for difficult settings ... Physical (Ke et al., 2021): a dense mode Physical environment. Furthermore, we also include 3 pixel-based environments of Modified Cartpole (Liu et al., 2024), Robedesk (Wang et al., 2022a) and Deep Mind Control (DMC) (Wang et al., 2022a) for evaluation in latent state environments. |
| Dataset Splits | No | The paper mentions generating OOD states: 'To create OOD states, we change object positions in the chemical environment and marker positions in the manipulation environment to unseen values, followed (Wang et al., 2022c).' However, it does not explicitly state specific training, validation, or test dataset splits in terms of percentages or sample counts for reproduction. |
| Hardware Specification | Yes | All experiments of this approach are implemented on 2 Intel(R) Xeon(R) Gold 6444Y and 4 NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | In our code, we have utilized the following libraries, each covered by its respective license agreements: Py Torch (BSD 3-Clause New or Revised License) Numpy (BSD 3-Clause New or Revised License) Tensorflow (Apache License 2.0) Robosuite (MIT License) Causal MBRL (MIT License) Open AI Gym (MIT License) Robo Desk (Apache License 2.0) Deep Mind Control (Apache License 2.0). Specific version numbers for these software dependencies are not provided. |
| Experiment Setup | Yes | We present the architectures of the proposed method across all environments in Table 3. For all activation functions, the Rectified Linear Unit (Re LU) is employed. Additionally, we summarize the hyperparameters for causal mask learning used in all environments for ECL-Con and ECL-Sco in Table 4. Regarding the other parameter settings, we adhered to the parameter configurations established in CDL (Wang et al., 2022c) and ASR (Huang et al., 2022). Moreover, The policy ฯcollect is trained with a reward function r = tanh(Pd S j=1 log p(sj t+1|st,at) / p(sj t+1|PAsj )). ... We list the downstream task learning architectures of the proposed method across all environments in Table 5. We outline the parameter configurations for the reward predictor, as well as the settings employed for the cross-entropy method that is applied. |