A Meta-Learning Approach to Bayesian Causal Discovery

Authors: Anish Dhir, Matthew Ashman, James Requeima, Mark van der Wilk

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare our meta-Bayesian causal discovery against existing Bayesian causal discovery methods, demonstrating the advantages of directly learning a posterior over causal structure. We show that the BCNP model generates accurate posterior samples compared to previous Bayesian meta-learning approaches (section 4.1). We also demonstrate that the BCNP model outperforms explicit Bayesian models, as well as other metalearning models when the model is trained on the correct data distribution (section 4.2), and also when the data distribution of a dataset is unknown (section 4.3).
Researcher Affiliation Academia Anish Dhir Imperial College London EMAIL Matthew Ashman University of Cambridge James Requeima University of Toronto Mark van der Wilk University of Oxford
Pseudocode No The paper describes the architecture of the encoder and decoder with text and a computational graph (Figure 2), but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes We provide code for our method: Causal Structure Neural Process.
Open Datasets Yes We generate graphs from Erdos Renyi (ER) distributions (Erdos et al., 1960) with varying densities, with expected edge counts of 20, 40, and 60. A commonly used simulator is the Syntren generator (Van den Bulcke et al., 2006) which generates gene expression data that matches known experimental data.
Dataset Splits No The paper describes the generation of datasets for training and testing (e.g., "We generate 200, 000 datasets in total with 1, 000 samples each" and "For training the Bayesian meta-learning models, we generate 500, 000 datasets. Each test set contains 25 datasets.") but does not specify train/validation/test splits within a single dataset, nor does it provide file names or specific instructions for splitting a fixed dataset.
Hardware Specification No The paper mentions "bfloat16" for memory reduction, which hints at hardware capabilities, but no specific GPU, CPU models, or other hardware specifications are explicitly described for running the experiments.
Software Dependencies No The paper mentions optimizers like Adam (Kingma & Ba, 2014) and inference algorithms like SG-MCMC (Ma et al., 2015) and Stein variational gradient descent (Liu & Wang, 2016). However, it does not provide specific version numbers for software libraries or programming languages (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For all the models, we use Adam (Kingma & Ba, 2014) with a learning rate of 10 4 and a batch size of 64. We train for 2 epochs and use a linear warmup for 10% of the total iterations. All models used 4 layers in their encoder, while BCNP and CSIv A used 4 decoder layers. For BCNP, we used 100 permutation samples to approximate the loss, and maximum of 1000 sinkhorn iterations. For AVICI and BCNP, we use a width of 512 for the attention layers and a width of 1024 for the feedforward layers. Due to memory constraints of autoregressive generation, for CSIv A we use a width of 256 for the attention and 512 for the feedforward layers. We use 8 attention heads for each model. We use Adam (Kingma & Ba, 2014) with a learning rate of 10 4 with a linear warmup of 10% of the total iterations. We use a batch size of 32 for AVICI and BCNP, and a batch size of 8 for CSIv A. Table 5 and Table 6 provide detailed hyperparameters for Di BS and Bayes DAG models, respectively.