Continuous Bayesian Model Selection for Multivariate Causal Discovery
Authors: Anish Dhir, Ruby Sedgwick, Avinash Kori, Ben Glocker, Mark Van Der Wilk
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments address two critical questions: (1) How much performance is lost compared to enumerating every causal graph? (2) How does our scalable Bayesian model selection compare to existing multivariate causal discovery methods in terms of performance? Our results show that Bayesian model selection outperforms methods enforcing strict identifiability even at larger scales. We compare our approach to various baselines that perform multivariate causal discovery to recover a DAG. We test our method on synthetic data generated from our model (Section 6.1), and data not generated from our model (Section 6.2). Then we test our model on a common semi-synthetic benchmark (Section 6.3). Additional experiments and results are in Appendix I. |
| Researcher Affiliation | Collaboration | 1Department of Computing, Imperial College London, London, UK 2Xyme, Oxford, UK 3Department of Computer Science, University of Oxford, Oxford, UK. |
| Pseudocode | Yes | Algorithm 1 Optimisation procedure for the Causal GP-CDE. Input: Data X, number of random restarts Nr, acyclic penalty weighting update ρ, threshold τ, acyclic penalty weighting γt = 0 initial Λ = {θ, σ} Result: Most likely adjacency matrix A |
| Open Source Code | Yes | Code can be found at: https://github.com/Anish144/Continuous BMSStructure Learning.git. Code and data to replicate the experiments in this paper can be found at https://github.com/Anish144/Continuous BMSStructure Learning. |
| Open Datasets | Yes | Syntren is a gene-regulatory network simulator that generates gene expression data from real biological graphs. There are 10 datasets of 20 nodes with 500 samples each. The results on Syntren can be seen in Figure 2b. Here, the CGP-CDE outperforms other methods, especially in terms of the SID and F1 scores. We use the data generated by Lachapelle et al. (2019). This contains 10 datasets of 500 samples, each with 20 nodes. The data was generated using the syntren data generator (Van den Bulcke et al., 2006). |
| Dataset Splits | No | The paper describes generating datasets with a certain number of samples (e.g., "We generate five datasets of 1000 samples for each of the six graphs" in Section 6.1, "For each graph, we sample 1000 data points" in Appendix F, and "10 datasets of 500 samples" for Syntren in Section 6.3). However, it does not explicitly provide details about how these generated datasets were split into training, validation, or test sets for the experiments conducted within the paper. There are no mentions of specific percentages (e.g., 80/10/10 split), absolute counts for splits, or references to predefined standard splits for evaluation. |
| Hardware Specification | Yes | The experiments in this paper were run on A100 and RTX 4090 GPUs. |
| Software Dependencies | No | The paper mentions several software tools and libraries used for baselines and its own implementation, such as Adam for optimization (Kingma & Ba, 2014), dodiscover, Gra N-DAG, and sdcd implementations from GitHub. However, it does not provide specific version numbers for any of these software components (e.g., "Adam" is mentioned but not "Adam vX.Y", and other libraries are mentioned by name or URL but without associated version numbers). |
| Experiment Setup | Yes | We use 400 inducing points for all datasets as we found this a reasonable trade off between computation time and accuracy. We initialise the hyperparameters θ Uniform(0.01, 1), except θlin = 0.25. All kernel variances σ are initialised to 1... and likelihood variance initialised as ϕ2 = 1 κ2 where κ Uniform(50, 100)... We use a minibatch size of 256 to calculate the loss. We take 50 Monte Carlo samples... The weighting γt of the acyclic penalty term h(A) is increased linearly each epoch by ρ = 50... We use 50 power iterations... For the 3, 20 and 50 variable datasets we scale ρ with the number of variables, such that ρ = 5D... First, we optimise the parameters... We linearly increase the natural gradient step size from 0.0001 to 0.1 for the first five iterations, and then use a step size of 0.1. Second, we optimise the Gaussian process hyperparameters using Adam with a learning rate of 0.05... Warm-up phase: T0 = 25, 000 iterations... Acyclic constraint phase: Ta = 50, 000... Cool-down phase: Tf = 25, 000 cool-down iterations. |