reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Node-Level Data Valuation on Graphs

Authors: Simone Antonelli, Aleksandar Bojchevski

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a comprehensive study of data valuation approaches applied to graph-structured models such as graph neural networks in a semi-supervised transductive setting. ... Overall, our results show that approaches accounting for subgraphs instead of single-node contribution yield more accurate data values. We demonstrate their usefulness for downstream applications: i) finding highly influential nodes... ii) spotting brittle predictions... iii) detecting poisoned (mislabeled) data; iv) estimating counterfactuals... and v) visualizations...
Researcher Affiliation	Academia	Simone Antonelli EMAIL CISPA Helmholtz Center for Information Security. Aleksandar Bojchevski EMAIL University of Cologne.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1The code is provided at https://github.com/siantonelli/graph_valuation.
Open Datasets	Yes	We evaluate methods on the largest connected component (LCC) of different citation graphs (Citeseer, Cora ML, Pub Med, and Co Physics) and co-purchase graphs (Photo and Computers).
Dataset Splits	Yes	Following Shchur et al. (2018), we employ stratified sampling to select the training nodes (20 per class) along with an equal number of validation nodes. The remaining nodes serve as the test set.
Hardware Specification	Yes	Experiments are performed on a cluster of 136 nodes, each equipped with 2x AMD Epyc 9654 processors (96 cores, 2.4 3.7 GHz) and 768 GB of RAM. For larger datasets, such as Co Physics, we switch to a bigger partition with 8 nodes (same CPU specifications) but 3 TB of RAM to accommodate higher memory requirements.
Software Dependencies	No	The paper mentions 'scikit-learn' and 'joblib' with citations, but does not provide specific version numbers for these libraries (e.g., 'scikit-learn 0.24' or 'joblib 1.0.1').
Experiment Setup	Yes	We train GCN and GAT using the Adam optimizer with a learning rate of 0.01 for 3000 epochs. Early stopping is based on the validation loss, with 50 epochs of patience. Results are averaged over 10 runs for each model, except for larger datasets where we use 5 runs.