reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Continuous Bayesian Model Selection for Multivariate Causal Discovery

Authors: Anish Dhir, Ruby Sedgwick, Avinash Kori, Ben Glocker, Mark Van Der Wilk

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments address two critical questions: (1) How much performance is lost compared to enumerating every causal graph? (2) How does our scalable Bayesian model selection compare to existing multivariate causal discovery methods in terms of performance? Our results show that Bayesian model selection outperforms methods enforcing strict identifiability even at larger scales. We compare our approach to various baselines that perform multivariate causal discovery to recover a DAG. We test our method on synthetic data generated from our model (Section 6.1), and data not generated from our model (Section 6.2). Then we test our model on a common semi-synthetic benchmark (Section 6.3). Additional experiments and results are in Appendix I.
Researcher Affiliation	Collaboration	1Department of Computing, Imperial College London, London, UK 2Xyme, Oxford, UK 3Department of Computer Science, University of Oxford, Oxford, UK.
Pseudocode	Yes	Algorithm 1 Optimisation procedure for the Causal GP-CDE. Input: Data X, number of random restarts Nr, acyclic penalty weighting update ρ, threshold τ, acyclic penalty weighting γt = 0 initial Λ = {θ, σ} Result: Most likely adjacency matrix A
Open Source Code	Yes	Code can be found at: https://github.com/Anish144/Continuous BMSStructure Learning.git. Code and data to replicate the experiments in this paper can be found at https://github.com/Anish144/Continuous BMSStructure Learning.
Open Datasets	Yes	Syntren is a gene-regulatory network simulator that generates gene expression data from real biological graphs. There are 10 datasets of 20 nodes with 500 samples each. The results on Syntren can be seen in Figure 2b. Here, the CGP-CDE outperforms other methods, especially in terms of the SID and F1 scores. We use the data generated by Lachapelle et al. (2019). This contains 10 datasets of 500 samples, each with 20 nodes. The data was generated using the syntren data generator (Van den Bulcke et al., 2006).
Dataset Splits	No	The paper describes generating datasets with a certain number of samples (e.g., "We generate five datasets of 1000 samples for each of the six graphs" in Section 6.1, "For each graph, we sample 1000 data points" in Appendix F, and "10 datasets of 500 samples" for Syntren in Section 6.3). However, it does not explicitly provide details about how these generated datasets were split into training, validation, or test sets for the experiments conducted within the paper. There are no mentions of specific percentages (e.g., 80/10/10 split), absolute counts for splits, or references to predefined standard splits for evaluation.
Hardware Specification	Yes	The experiments in this paper were run on A100 and RTX 4090 GPUs.
Software Dependencies	No	The paper mentions several software tools and libraries used for baselines and its own implementation, such as Adam for optimization (Kingma & Ba, 2014), dodiscover, Gra N-DAG, and sdcd implementations from GitHub. However, it does not provide specific version numbers for any of these software components (e.g., "Adam" is mentioned but not "Adam vX.Y", and other libraries are mentioned by name or URL but without associated version numbers).
Experiment Setup	Yes	We use 400 inducing points for all datasets as we found this a reasonable trade off between computation time and accuracy. We initialise the hyperparameters θ Uniform(0.01, 1), except θlin = 0.25. All kernel variances σ are initialised to 1... and likelihood variance initialised as ϕ2 = 1 κ2 where κ Uniform(50, 100)... We use a minibatch size of 256 to calculate the loss. We take 50 Monte Carlo samples... The weighting γt of the acyclic penalty term h(A) is increased linearly each epoch by ρ = 50... We use 50 power iterations... For the 3, 20 and 50 variable datasets we scale ρ with the number of variables, such that ρ = 5D... First, we optimise the parameters... We linearly increase the natural gradient step size from 0.0001 to 0.1 for the first five iterations, and then use a step size of 0.1. Second, we optimise the Gaussian process hyperparameters using Adam with a learning rate of 0.05... Warm-up phase: T0 = 25, 000 iterations... Acyclic constraint phase: Ta = 50, 000... Cool-down phase: Tf = 25, 000 cool-down iterations.