Expressiveness of Parametrized Distributions over DAGs for Causal Discovery
Authors: Simon Rittel, Sebastian Tschiatschek
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we focus on the expressiveness of parametrized distributions over DAGs in the context of causal structure learning and show several limitations of candidate models in a theoretical analysis and validate them empirically in relevant supervised settings. |
| Researcher Affiliation | Academia | Simon Rittel EMAIL Department of Statistics, LMU Munich, Germany Munich Center for Machine Learning, Germany Uni Vie Doctoral School Computer Science, Austria Sebastian Tschiatschek EMAIL Faculty of Computer Science, University of Vienna, Austria Research Network Data Science, University of Vienna, Austria |
| Pseudocode | No | The paper describes generative models and outline steps using mathematical notation and figures, but does not include a distinct 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | ARCO-DAG The autoregressive neural network of ARCO-DAG consists of a simple two layer perceptron with HN = 30 hidden neurons and Re LU-activations and follows the official implementation on https: //github.com/chritoth/bci-arco-gp/. GFlow Net-Dag The transformer architecture for the GFlow Net-Dag model follows the official implementation provided on https://github.com/tristandeleu/jax-dag-gflownet |
| Open Datasets | No | The target distribution is either derived from the MEC in Example 1, by the coupling of edges from Example 2, or a synthetically generated distribution that arises from concentrating the probability mass around a target graph based on the structural Hamming distance (SHD). In the absence of an analytic posterior that motivates such similarity, we generate a synthetic target distribution around the assumed maximimum-a-posteriori (MAP) graph G depicted in Figure 5 that has positive support for all 543 possible DAGs with 4 nodes. |
| Dataset Splits | No | In the supervised setting, we minimize the forward KL divergence between the target distribution and the model distribution using the Adam Optimizer with decoupled weight decay (Loshchilov & Hutter, 2019) over 1000 optimization steps. For training of the parameters ϕ with gradient descent, we take the forward Kullback-Leibler (KL) divergence between the target distribution p G and the candidate distribution q G as the loss function and approximate it using samples from the target distribution. |
| Hardware Specification | Yes | The computations were conducted on a 11th Gen. Intel(R) Core i7-1165G7 processor with 2.80 GHz, 4 cores and 8 logical processing units paired with 32 GB of DDR SDRAM. |
| Software Dependencies | No | The paper mentions the "Adam Optimizer" and specific model architectures (e.g., "multilayer perceptron", "transformer architecture") but does not provide specific version numbers for any software libraries or frameworks. |
| Experiment Setup | Yes | In the supervised setting, we minimize the forward KL divergence between the target distribution and the model distribution using the Adam Optimizer with decoupled weight decay (Loshchilov & Hutter, 2019) over 1000 optimization steps. Further details and the used hyperparameters for each model are provided in section D.2. Table 5: Hyperparameters. (a) For the experiments in section 5.1 and 5.2. Graph model Learning rate # forward KL samples # IS for training # IS for evaluation RPM-DAG 0.5 25 1 100 ARCO-DAG 0.5 25 1 100 GFlow Net-DAG 0.001 25 10 100 |