Node-Level Data Valuation on Graphs

Authors: Simone Antonelli, Aleksandar Bojchevski

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comprehensive study of data valuation approaches applied to graph-structured models such as graph neural networks in a semi-supervised transductive setting. ... Overall, our results show that approaches accounting for subgraphs instead of single-node contribution yield more accurate data values. We demonstrate their usefulness for downstream applications: i) finding highly influential nodes... ii) spotting brittle predictions... iii) detecting poisoned (mislabeled) data; iv) estimating counterfactuals... and v) visualizations...
Researcher Affiliation Academia Simone Antonelli EMAIL CISPA Helmholtz Center for Information Security. Aleksandar Bojchevski EMAIL University of Cologne.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes 1The code is provided at https://github.com/siantonelli/graph_valuation.
Open Datasets Yes We evaluate methods on the largest connected component (LCC) of different citation graphs (Citeseer, Cora ML, Pub Med, and Co Physics) and co-purchase graphs (Photo and Computers).
Dataset Splits Yes Following Shchur et al. (2018), we employ stratified sampling to select the training nodes (20 per class) along with an equal number of validation nodes. The remaining nodes serve as the test set.
Hardware Specification Yes Experiments are performed on a cluster of 136 nodes, each equipped with 2x AMD Epyc 9654 processors (96 cores, 2.4 3.7 GHz) and 768 GB of RAM. For larger datasets, such as Co Physics, we switch to a bigger partition with 8 nodes (same CPU specifications) but 3 TB of RAM to accommodate higher memory requirements.
Software Dependencies No The paper mentions 'scikit-learn' and 'joblib' with citations, but does not provide specific version numbers for these libraries (e.g., 'scikit-learn 0.24' or 'joblib 1.0.1').
Experiment Setup Yes We train GCN and GAT using the Adam optimizer with a learning rate of 0.01 for 3000 epochs. Early stopping is based on the validation loss, with 50 epochs of patience. Results are averaged over 10 runs for each model, except for larger datasets where we use 5 runs.