Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Distinguishing Cause from Effect with Causal Velocity Models
Authors: Johnny Xi, Hugh Dance, Peter Orbanz, Benjamin Bloem-Reddy
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method empirically on new synthetic datasets as well as on existing benchmarks from the literature (Mooij et al., 2016; Immer et al., 2023). We study the relative performance of different velocity model classes, including velocity-based parametrizations of known classes. In Section 7.3, we show empirically that the performance of our method hinges on the accuracy of the score estimators. When the true score is given, our method achieves perfect causal discovery with as few as n = 100 samples (Figure 4). |
| Researcher Affiliation | Academia | 1Department of Statistics, University of British Columbia 2Gatsby Unit, University College London. Correspondence to: Johnny Xi <EMAIL>. |
| Pseudocode | No | The paper describes methods in prose and mathematical formulations. There are no explicit blocks labeled "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | Code supporting our experiments can be found on Github at https://github.com/xijohnny/causal-velocity. |
| Open Datasets | Yes | We evaluate our method empirically on new synthetic datasets as well as on existing benchmarks from the literature (Mooij et al., 2016; Immer et al., 2023). ... we also evaluate our method on the SIM-series of simulated benchmarks and the T ubingen Cause-Effect pairs of Mooij et al. (2016). |
| Dataset Splits | Yes | All synthetic benchmarks can be described as generating from an SCM Y = fθ(X, ϵy). In each case 100 samples are drawn from θ N(0, σ2 θ) to generate the 100 datasets. ... All benchmarks have a sample size of n = 1000 besides the T ubingen dataset. To minimize hyperparameter search, all datasets are sub-sampled, or re-sampled if necessary, to a uniform size of n = 1000. |
| Hardware Specification | Yes | Experiments are written in JAX (Bradbury et al., 2018) and carried out either on a M1 Mac or NVIDIA V100 GPU. |
| Software Dependencies | No | Experiments are written in JAX (Bradbury et al., 2018) and carried out either on a M1 Mac or NVIDIA V100 GPU. The paper mentions JAX but does not specify a version number for the software used. |
| Experiment Setup | Yes | Given an estimate of the score, we minimize (18) using the Adam optimizer (Kingma & Ba, 2015) to estimate the velocity. ... we penalize the complexity of the model via higher order derivative terms dky(x)/dxk ... In practice we use k = 2. ... Adam optimizer with a base learning rate of 0.1, scaled by a factor of 1/ log(# of parameters). ... 3 layer fully connected MLPs with a hidden size of 64, and tanh activation functions. ... For the Stein score estimate, we use the suggested regularization parameter λ = 0.1 (Li & Turner, 2018). For the KDE, we use a regularization parameter of ϵ = n 2 as suggested in (Wibisono et al., 2024). |