reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CEGA: A Cost-Effective Approach for Graph-Based Model Extraction and Acquisition

Authors: Zebin Wang, Menghan Lin, Bolin Shen, Ken Anderson, Molei Liu, Tianxi Cai, Yushun Dong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on benchmark graph datasets demonstrate our superiority over comparable baselines on accuracy, fidelity, and F1 score under strict query-size constraints. These results highlight both the susceptibility of deployed GNNs to extraction attacks and the promise of ethical, efficient GNN acquisition methods to support low-resource research environments. Our implementation is publicly available at https://github.com/Lab RAI/CEGA.
Researcher Affiliation	Academia	1Department of Biostatistics, T. H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA 2Department of Statistics, Florida State University, Tallahassee, Florida, USA 3Department of Computer Science, Florida State University, Tallahassee, Florida, USA 4Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York, USA. Correspondence to: Tianxi Cai <EMAIL>, Yushun Dong <EMAIL>.
Pseudocode	Yes	We summarize the algorithmic routine of CEGA in Algorithm 1. Algorithm 1 The Proposed Framework of CEGA
Open Source Code	Yes	Our implementation is publicly available at https://github.com/Lab RAI/CEGA.
Open Datasets	Yes	Our experiments are conducted on 6 widely used benchmark datasets: (1) Coauthorship networks where nodes are authors and edges represent collaboration, including Coauthor CS and Coauthor-Physics; (2) Co-purchase graphs with nodes as products and edges as items frequently purchased together, including Amazon-Computer and Amazon-Photo; and (3) Academic citation and collaboration network, including Cora-Full and DBLP. These datasets vary in size, complexity, and formality of node attributes, providing a comprehensive basis for evaluating CEGA s performance. The dataset statistics are provided in Appendix B.1.
Dataset Splits	Yes	If training and test sets are not provided, we randomly select 60% of the nodes for training and use the remaining 40% for testing.
Hardware Specification	Yes	All experiments are conducted on two NVIDIA RTX 6000 Ada GPUs.
Software Dependencies	No	The paper mentions training GCN models and using active learning techniques (AGE, GRAIN), but does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used for implementation.
Experiment Setup	Yes	Initially, we train a target model, f T, for 1000 epochs with a learning rate of 1e-3... In the initialization step, we randomly select 2 nodes from each class across all the tested datasets, resulting in a total of 2C nodes... The total budget is capped at 20C. ...For our proposed method, in cycle γ, CEGA queries κ = 1 node and trains a 2-layer GCN model with {Vγ, Ga} for E = 1 epoch. In the analysis for node diversity, we set the weight ρ = 0.8... We set the initial weight coefficients as α1 = α2 = α3 = 0.2, the measurement of the initial weight gap between Rγ 1 and Rγ 2 as = 0.6, the measurement of the curvature for the weight changes as λ = 0.3. After the node selection process, we train a 2-layer GCN with a hidden dimension of 16. The model is optimized with a learning rate of 1e-3 and trained for 1000 epochs. For AGE, we apply a warm-up period of 400 epochs.