reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CoDy: Counterfactual Explainers for Dynamic Graphs

Authors: Zhan Qu, Daniel Gomm, Michael Färber

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments against state-of-the-art factual and counterfactual baselines demonstrate Co Dy s effectiveness, with improvements of 16% in AUFSC+ over the strongest baseline. Our code is available at: https://github.com/daniel-gomm/Co Dy
Researcher Affiliation	Collaboration	*Equal contribution 1TU Dresden, Dresden, Germany 2Sca DS.AI, Dresden, Germany 3Karlsruhe Institute of Technology, Karlsruhe, Germany 4University of Amsterdam, Amsterdam, Netherlands 5Centrum Wiskunde en Informatica, Amsterdam, Netherlands. Correspondence to: Daniel Gomm <EMAIL>, Zhan Qu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Search algorithm of Co Dy. Input: TGNN model f, input graph G, explained event εi, selection policy δ, max iterations itmax Output: best explanation found porig f(G(ti), εi) nroot ( , null, null, , 0, null, 1) it 0 while it < itmax and nroot is selectable do nselected select(nroot, δ) simulate(nselected, f, G, εi) expand(nselected, porig) backpropagate(parentselected) it it + 1 end nbest select best(nroot) return sbest
Open Source Code	Yes	Our code is available at: https://github.com/daniel-gomm/Co Dy
Open Datasets	Yes	We evaluate on three datasets: Wikipedia (Kumar et al., 2019), UCI-Messages (Kunegis, 2013), and UCIForums (Kunegis, 2013).
Dataset Splits	No	The paper mentions performance for Transductive/Inductive settings and evaluates instances where the TGNN makes correct versus incorrect predictions, but it does not specify explicit training/validation/test split percentages, sample counts, or refer to standard predefined splits with specific details (e.g., "80/10/10 split" or "standard splits from X benchmark").
Hardware Specification	Yes	For replicability, we run the experiments on a high-performance computing cluster with an Intel Xeon Gold 6230 CPU, 16GB of RAM, and an NVIDIA Tesla V100 SXM2 GPU with 32GB of VRAM.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We train the TGN model using the TGNattn configuration from the original paper (Rossi et al., 2020). [...] We configure Gree Dy with a candidate event limit of 64, sampling up to 10 events per iteration. For Co Dy, we also limit the search space to 64 events, with a maximum of 300 iterations and α = 2/3 to emphasize exploration over exploitation. Appendix G shows a sensitivity analysis on these parameters.