CEGA: A Cost-Effective Approach for Graph-Based Model Extraction and Acquisition

Authors: Zebin Wang, Menghan Lin, Bolin Shen, Ken Anderson, Molei Liu, Tianxi Cai, Yushun Dong

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on benchmark graph datasets demonstrate our superiority over comparable baselines on accuracy, fidelity, and F1 score under strict query-size constraints. These results highlight both the susceptibility of deployed GNNs to extraction attacks and the promise of ethical, efficient GNN acquisition methods to support low-resource research environments. Our implementation is publicly available at https://github.com/Lab RAI/CEGA.
Researcher Affiliation Academia 1Department of Biostatistics, T. H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA 2Department of Statistics, Florida State University, Tallahassee, Florida, USA 3Department of Computer Science, Florida State University, Tallahassee, Florida, USA 4Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York, USA. Correspondence to: Tianxi Cai <EMAIL>, Yushun Dong <EMAIL>.
Pseudocode Yes We summarize the algorithmic routine of CEGA in Algorithm 1. Algorithm 1 The Proposed Framework of CEGA
Open Source Code Yes Our implementation is publicly available at https://github.com/Lab RAI/CEGA.
Open Datasets Yes Our experiments are conducted on 6 widely used benchmark datasets: (1) Coauthorship networks where nodes are authors and edges represent collaboration, including Coauthor CS and Coauthor-Physics; (2) Co-purchase graphs with nodes as products and edges as items frequently purchased together, including Amazon-Computer and Amazon-Photo; and (3) Academic citation and collaboration network, including Cora-Full and DBLP. These datasets vary in size, complexity, and formality of node attributes, providing a comprehensive basis for evaluating CEGA s performance. The dataset statistics are provided in Appendix B.1.
Dataset Splits Yes If training and test sets are not provided, we randomly select 60% of the nodes for training and use the remaining 40% for testing.
Hardware Specification Yes All experiments are conducted on two NVIDIA RTX 6000 Ada GPUs.
Software Dependencies No The paper mentions training GCN models and using active learning techniques (AGE, GRAIN), but does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used for implementation.
Experiment Setup Yes Initially, we train a target model, f T, for 1000 epochs with a learning rate of 1e-3... In the initialization step, we randomly select 2 nodes from each class across all the tested datasets, resulting in a total of 2C nodes... The total budget is capped at 20C. ...For our proposed method, in cycle γ, CEGA queries κ = 1 node and trains a 2-layer GCN model with {Vγ, Ga} for E = 1 epoch. In the analysis for node diversity, we set the weight ρ = 0.8... We set the initial weight coefficients as α1 = α2 = α3 = 0.2, the measurement of the initial weight gap between Rγ 1 and Rγ 2 as = 0.6, the measurement of the curvature for the weight changes as λ = 0.3. After the node selection process, we train a 2-layer GCN with a hidden dimension of 16. The model is optimized with a learning rate of 1e-3 and trained for 1000 epochs. For AGE, we apply a warm-up period of 400 epochs.