reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Causal Ordering for Structure Learning from Time Series

Authors: Pedro Sanchez, Damian Machlanski, Steven McDonagh, Sotirios A. Tsaftaris

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate the approach. Empirical evaluations on synthetic and real-world datasets demonstrate that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery. On synthetic benchmarks spanning d=3 6 variables, T=200 5,000 samples and up to three lags, DOTS improves mean window-graph F1 from 0.63 (best baseline) to 0.81. On the Causal Time real-world benchmark (Medical, AQI, Traffic; d=20 36), while baselines remain the best on individual datasets, DOTS attains the highest average summary-graph F1 while halving runtime relative to graph-optimisation methods.
Researcher Affiliation	Academia	Pedro P. Sanchez EMAIL School of Engineering, University of Edinburgh, UK Damian Machlanski EMAIL School of Engineering, University of Edinburgh, UK Causality in Healthcare AI Hub (CHAI), UK Steven Mc Donagh EMAIL School of Engineering, University of Edinburgh, UK Causality in Healthcare AI Hub (CHAI), UK Sotirios A. Tsaftaris EMAIL School of Engineering, University of Edinburgh, UK Causality in Healthcare AI Hub (CHAI), UK
Pseudocode	Yes	Algorithm 1 Estimating Multi-Scale Causal Orderings.
Open Source Code	Yes	Code is available at https://github.com/CHAI-UK/DOTS.
Open Datasets	Yes	We also perform experiments on datasets closer to real-life complexities. To achieve this, we incorporate Causal Time (Cheng et al., 2024), a realistic benchmark for time series causal discovery. Causal Time provides three datasets: Air Quality Index (AQI), Traffic, and Medical. ... Medical: N=20 vital-sign and chart-event channels extracted from 1000 MIMIC-IV ICU stays, resampled to 2-h resolution (T=600 on average).
Dataset Splits	No	The paper describes how synthetic data is generated and how real-world datasets are combined and pre-processed for use (e.g., combining 480 samples of length T=40 into a single dataset of length T=19679). However, it does not explicitly provide information on dataset splits such as training, validation, or test sets with specific percentages, counts, or references to predefined splits for model learning. The evaluation compares predicted edges to ground truth, implying the entire dataset is used for analysis without explicit splits for learning.
Hardware Specification	No	The paper includes experimental results and runtime analysis (e.g., Figure 8 showing average runtime on synthetic data), but it does not provide specific details about the hardware used to run these experiments (e.g., GPU models, CPU types, or memory configurations).
Software Dependencies	No	Table 4 provides a "Summary of source code used to run the methods in the experiments", listing the methods and their respective GitHub repositories. However, it does not specify version numbers for any of these libraries, underlying programming languages (e.g., Python), or other essential software dependencies required for reproducibility.
Experiment Setup	Yes	Table 3: Summary of hyperparameters of all methods used in the experiments. This table lists specific hyperparameters and their values for various methods, including CAM (alpha = 0.05), SCORE (α = 0.05, ηG = 0.001, ηH = 0.001), TCDF (epochs = 5000, layers = 2, lr = 0.01 kernel_size = 4, dilation = 4, significance = 0.8), Diff AN (steps = 100, nn_depth = 3, batch_size = 1024 early_stop = 300, lr = 0.001), and DOTS (steps = 100, nn_depth = 3, batch_size = 1024 early_stop = 300, lr = 0.001, n_ord = 10).