reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distinguishing Cause from Effect with Causal Velocity Models

Authors: Johnny Xi, Hugh Dance, Peter Orbanz, Benjamin Bloem-Reddy

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method empirically on new synthetic datasets as well as on existing benchmarks from the literature (Mooij et al., 2016; Immer et al., 2023). We study the relative performance of different velocity model classes, including velocity-based parametrizations of known classes. In Section 7.3, we show empirically that the performance of our method hinges on the accuracy of the score estimators. When the true score is given, our method achieves perfect causal discovery with as few as n = 100 samples (Figure 4).
Researcher Affiliation	Academia	1Department of Statistics, University of British Columbia 2Gatsby Unit, University College London. Correspondence to: Johnny Xi <EMAIL>.
Pseudocode	No	The paper describes methods in prose and mathematical formulations. There are no explicit blocks labeled "Pseudocode" or "Algorithm".
Open Source Code	Yes	Code supporting our experiments can be found on Github at https://github.com/xijohnny/causal-velocity.
Open Datasets	Yes	We evaluate our method empirically on new synthetic datasets as well as on existing benchmarks from the literature (Mooij et al., 2016; Immer et al., 2023). ... we also evaluate our method on the SIM-series of simulated benchmarks and the T ubingen Cause-Effect pairs of Mooij et al. (2016).
Dataset Splits	Yes	All synthetic benchmarks can be described as generating from an SCM Y = fθ(X, ϵy). In each case 100 samples are drawn from θ N(0, σ2 θ) to generate the 100 datasets. ... All benchmarks have a sample size of n = 1000 besides the T ubingen dataset. To minimize hyperparameter search, all datasets are sub-sampled, or re-sampled if necessary, to a uniform size of n = 1000.
Hardware Specification	Yes	Experiments are written in JAX (Bradbury et al., 2018) and carried out either on a M1 Mac or NVIDIA V100 GPU.
Software Dependencies	No	Experiments are written in JAX (Bradbury et al., 2018) and carried out either on a M1 Mac or NVIDIA V100 GPU. The paper mentions JAX but does not specify a version number for the software used.
Experiment Setup	Yes	Given an estimate of the score, we minimize (18) using the Adam optimizer (Kingma & Ba, 2015) to estimate the velocity. ... we penalize the complexity of the model via higher order derivative terms dky(x)/dxk ... In practice we use k = 2. ... Adam optimizer with a base learning rate of 0.1, scaled by a factor of 1/ log(# of parameters). ... 3 layer fully connected MLPs with a hidden size of 64, and tanh activation functions. ... For the Stein score estimate, we use the suggested regularization parameter λ = 0.1 (Li & Turner, 2018). For the KDE, we use a regularization parameter of ϵ = n 2 as suggested in (Wibisono et al., 2024).