reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Causal Concept Graph Models: Beyond Causal Opacity in Deep Learning

Authors: Gabriele Dominici, Pietro Barbiero, Mateo Espinosa Zarlenga, Alberto Termine, Martin Gjoreski, Giuseppe Marra, Marc Langheinrich

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that Causal CGMs can: (i) match the generalisation performance of causally opaque models, (ii) enable human-in-the-loop corrections to mispredicted intermediate reasoning steps, boosting not just downstream accuracy after corrections but also the reliability of the explanations provided for specific instances, and (iii) support the analysis of interventional and counterfactual scenarios, thereby improving the model s causal interpretability and supporting the effective verification of its reliability and fairness.
Researcher Affiliation	Collaboration	Gabriele Dominici Università della Svizzera italiana EMAIL Pietro Barbiero IBM Research EMAIL Mateo Espinosa Zarlenga University of Cambridge EMAIL Alberto Termine IDSIA EMAIL Martin Gjoreski Università della Svizzera italiana EMAIL Giuseppe Marra KU Leuven EMAIL Marc Langheinrich Università della Svizzera italiana EMAIL
Pseudocode	No	The paper describes methods and processes in paragraph form and through equations but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The code related to this paper is publicly available 2. https://github.com/gabriele-dominici/Causal CGM
Open Datasets	Yes	To answer these questions, we use four datasets: (i) Checkmark, a synthetic dataset composed of four endogenous variables; (ii) d Sprites, where endogenous variables correspond to object types together with their position, colour, and shape; (iii) Celeb A, a facial recognition dataset where endogenous variables represent facial attributes; (iv) CIFAR10, an animal classification dataset where the endogenous variables are extracted automatically following Oikarinen et al. (2023). ... d Sprites (Matthey et al., 2017) ... Celeb A (Liu et al., 2015) ... CIFAR-10 (Krizhevsky, 2009)
Dataset Splits	No	The paper mentions using a 'validation set' to determine the optimal epoch for training, but it does not specify the exact percentages or counts for training, validation, or test splits. For example, in Section G.4: 'The optimal epoch for each was determined based on label accuracy on the validation set.'
Hardware Specification	Yes	All the experiments except the CIFAR10 ones were performed on a device equipped with an M3 Max and 36GB of RAM, without the use of a GPU. The CIFAR10 experimentswere conducted on a workstation equipped with an NVIDIA RTX A6000 GPU, two AMD EPYC 7513 32-Core processors, and 512 GB of RAM.
Software Dependencies	Yes	For our experiments, we implement all baselines and methods in Python 3.9 and relied upon open-source libraries such as Py Torch 2.0 (Paszke et al., 2019) (BSD license), Pytorch Lightning v2.1.2 (Apache Licence 2.0), Sklearn 1.2 (Pedregosa et al., 2011) (BSD license). In addition, we used Matplotlib (Hunter, 2007) 3.7 (BSD license) to produce the plots shown in this paper.
Experiment Setup	Yes	Hyperparameters All baseline and proposed models were trained for varying epochs across different datasets: 500 for Checkmark, 200 for d Sprites, 30 for Celeb A and 25 CIFAR10. The optimal epoch for each was determined based on label accuracy on the validation set. A uniform learning rate of 0.01 was applied across all models and datasets. For the CBM and CEM models, both concept and task losses were equally weighted at 1. This weighting scheme was also applied to the loss terms for endogenous copies prediction, endogenous variables prediction (λ1), and graph priors (λ2). The weight assigned to the loss terms in our models to maximise Ca CE is 0.05. Additionally, γ was treated as a learnable parameter, initialised at 0.1, and β was set to 1. All experiments were conducted using five different seeds (1, 2, 3, 4, 5).