Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Implicit degree bias in the link prediction task

Authors: Rachith Aiyappa, Xin Wang, Munjung Kim, Ozgur Can Seckin, Yong-Yeol Ahn, Sadamori Kojaku

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the impact of degree bias using the Price graph with N = 105 nodes and M = 106 edges, which follows a power-law degree distribution p(k) k 3 (Fig. 1B). We uniformly sample β = 0.25 of the edges from E as positive edges, together with an equal number of unconnected node pairs sampled uniformly from V. The degree distributions for nodes in the positive and negative edges align with ppos(k) and pneg(k), respectively, confirming the sampling bias due to node degree.
Researcher Affiliation Academia 1Center for Complex Networks and Systems Research, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA 2School of Systems Science and Industrial Engineering, Binghamton University, Binghamton, NY, USA. Correspondence to: Sadamori Kojaku <EMAIL>.
Pseudocode Yes Algorithm 1 Degree-corrected link prediction benchmark Input: Graph G(V, E), Sampling fraction β [0, 1] for positive edges Output: Set of negative edges Eneg and set of positive edges Epos Generate Epos by randomly sampling a β fraction of edges in E Initialize Eneg Create a node list L where each node i V with degree ki appears ki times while |Eneg| < |Epos| do Randomly select two nodes i, j from L with replacement if (i, j) / E and (i, j) / Eneg and i = j then Eneg Eneg {(i, j)} end if end while return Eneg, Epos
Open Source Code Yes A Python package for the degree-corrected benchmark will be made available on Git Hub. The source data, code, and workflow for our experiments are available on Git Hub at https://github.com/skojaku/degree-corrected-link-prediction-benchmark.
Open Datasets Yes We also considered networks from the OGB benchmark (Hu et al., 2020a). The largest networks in our corpus (number of nodes > 105) are sourced from Netzschleuder (Peixoto, 2020), and the remaining networks are obtained from the authors of Ref. (Erkol et al., 2019).
Dataset Splits Yes First, a fraction β of edges is randomly sampled from the edge set E as positive edges. Second, an equal number of unconnected node pairs is randomly sampled with replacement from the node-set V as negative edges. We set the test edge fraction to β = 0.25 and repeat the experiment 5 times. We used held-out validation to tune the number of dimensions in each hidden layer (32, 64) and dropout rate (0.2, 0.5) with the validation set consisting of 10% of the edges.
Hardware Specification No We acknowledge NVIDIA Corporation for their GPU resources and express our gratitude to Kaleb Smith from the NVIDIA SAE-Higher Education Research team for his help of GPU optimization. No specific hardware models or detailed specifications were provided beyond a general mention of "GPU resources".
Software Dependencies Yes We tested a variety of graph neural network (GNN) architectures for link prediction, leveraging the Py Torch Geometric library (Fey & Lenssen, 2019). We fit a log-normal distribution to the degree distribution of the graphs by using the moment method implemented in the scipy.stats.lognorm package (Virtanen et al., 2020). We fit a powerlaw distribution to the degree distribution of the graphs by using the maximum likelihood method implemented in the powerlaw package (Alstott et al., 2014). We compute the RBO score by using the rbo package (rbo). We fit the SBMs using the graph tool package (Gra). We use the scikit-learn package (Pedregosa et al., 2011) to perform the logistic regression...
Experiment Setup Yes We set the test edge fraction to β = 0.25 and repeat the experiment 5 times. For LRW and LPI, we set the hyperparameter ϵ = 0.001 as per previous studies (L u et al., 2009; Liu & L u, 2010). The MLP consists of two hidden layers coupled with a Leaky Re LU activation function, and we used held-out validation to tune the number of dimensions in each hidden layer (32, 64) and dropout rate (0.2, 0.5) with the validation set consisting of 10% of the edges. We used the Adam optimizer at a learning rate of 0.001. For all methods, we set the number of embedding dimensions to 128. For LINE, node2vec, and Deep Walk, we set the number of walkers to 40 and the number of the walk length to 80 following Ref. (Kojaku et al., 2024). We used held-out validation to tune the number of hidden layers (1 or 2) and the number of dimensions in each hidden layer (64, 128, or 256) with the validation set consisting of 10% of the edges. We use Re Lu activation and a dropout rate of 0.2. The node features are the 64 principal eigenvectors of the adjacency matrix, and we extend the feature vector by adding a 64-dimensional vector with each element being generated from an independent Gaussian distribution with mean 0 and standard deviation 1 by following Ref. (Sato et al., 2021; Abboud et al., 2020). We train GNNs on the link prediction task for 250 epochs with a dropout rate of 0.2, using the Adam optimizer at a learning rate of 0.01. We use the Link Neighbor Loader from Py Torch Geometric to generate training mini-batches. This loader samples both positive and negative edges, along with 30 immediate neighbors and 10 secondary neighbors sampled by random walks for each node involved in these edges (Hamilton et al., 2017). The batch size is set to 5000.