Random Walk Diffusion for Efficient Large-Scale Graph Generation
Authors: Tobias Bernecker, Ghalia Rehawi, Francesco Paolo Casale, Janine Knauer-Arloth, Annalisa Marsico
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we compare ARROW-Diff to six baseline methods for graph generation that can scale to large graphs. These methods cover the different graph generation techniques which are explained in Section 1 and Section 2. To evaluate the performance of all methods, we use five different real-world citation graph datasets, each containing a single undirected graph and all considered large graphs with varying numbers of nodes/edges. We also evaluate the performance of ARROW-Diff in generating more structured and nonscale-free graphs. |
| Researcher Affiliation | Academia | Tobias Bernecker EMAIL Computational Health Center, Helmholtz Munich School of Computation, Information and Technology, Technical University of Munich. Ghalia Rehawi EMAIL Computational Health Center, Helmholtz Munich Max Planck Institute of Psychiatry, Munich School of Life Sciences, Technical University of Munich. Francesco Paolo Casale EMAIL Computational Health Center, Helmholtz Munich Helmholtz Pioneer Campus, Helmholtz Munich School of Computation, Information and Technology, Technical University of Munich. Janine Knauer-Arloth EMAIL Max Planck Institute of Psychiatry, Munich Computational Health Center, Helmholtz Munich. Annalisa Marsico EMAIL Computational Health Center, Helmholtz Munich |
| Pseudocode | Yes | Algorithm 1 Optimizing Random Walk OA-ARDMs 1: Input: A random walk x V D, the number of nodes N = |V |, and a network f. 2: Output: ELBO L. 3: Sample t U(1, . . . , D) 4: Sample σ U(SD) 5: Compute m (σ < t) 6: Compute i m x + (1 m) ((N + 1) 1D) 7: l (1 m) log C(x|f(i, t)) 8: Lt 1 D t+1sum(l) Algorithm 2 ARROW-Diff Graph Generation 1: Input: A trained OA-ARDM, a trained GNN. The node set V , features X and degrees d G of an original graph G with the same node ordering as for training the OA-ARDM. The number of steps L to generate the graph. 2: Output: A generated graph ˆG = (V, ˆE). 3: Start with an empty graph ˆG = (V, ˆE), where ˆE = 4: Set the initial start nodes Vstart to all nodes in the graph: Vstart = V 5: for l = 1 to L do 6: Sample one random walk for each start node n Vstart using the OA-ARDM: R 7: Compute edge proposals from R: ˆEproposals := {(ni, nj) R|ni, nj V, i = j} 8: Run the GNN on G = (V, ˆE ˆEproposals, X) to obtain probabilities for all edges in ˆE ˆEproposals 9: Sample valid edges ˆEvalid from ˆE ˆEproposals according to the edge probabilities 10: Edge update: ˆE ˆEvalid 11: if l < L then 12: Compute the node degrees d ˆ G of ˆG based on ˆE 13: Compute d := max(0, d G d ˆ G) 14: Compute a probability for each node n V : p(n) = dn max(d) 15: Sample start nodes Vstart from V according to p(n) using a Bernoulli distribution 16: end if 17: end for |
| Open Source Code | Yes | The full list of hyper-parameters for training the OA-ARDMs and the GNN models can be found in our code repository https://github.com/marsico-lab/arrow-diff. |
| Open Datasets | Yes | To evaluate ARROW-Diff, we use five citation graph datasets: Cora-ML (Mc Callum et al., 2000), Cora (Mc Callum et al., 2000), Cite Seer (Giles et al., 1998), DBLP (Pan et al., 2016), and Pub Med (Sen et al., 2008). For Cora-ML and Cora, we use the pre-processed versions from Bojchevski & Günnemann (2018). |
| Dataset Splits | Yes | Similar to Bojchevski et al. (2018), we split the edges of each graph into training, validation, and test parts, and use only the training edges to train the baseline methods, as well as the OA-ARDM and GNN for ARROW-Diff. |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) are provided. The paper only mentions 'GPU memory' in the context of reducing batch size. |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned. The paper refers to architectures like U-Net, ResNet, BERT-like, and GCN, but not the software libraries or their versions used for implementation. |
| Experiment Setup | Yes | Specifically, we use a U-Net (Ronneberger et al., 2015) architecture similar to Ho et al. (2020) with one Res Net block and two levels for the down- and up-sampling processes. Similar to Bojchevski et al. (2018) we set the random walk length, which equals the number of diffusion steps, to D = 16 and sample batches of random walks from the training split comprising edges from the original graph. Each node in the random walks, as well as each diffusion time step, is then represented using 64-dimensional learnable embeddings and passed as input to the U-Net. For the GNN component, we train a two-layer GCN (Kipf & Welling, 2016a) to predict the validity of edges based on perturbed versions of the training split of the input graph. ... The positional encodings are 64-dimensional for Cora-ML and Cite Seer, and 128-dimensional for Cora, DBLP, and Pub Med. The GCN uses node embeddings of sizes 100 and 10 in the hidden and output layer, respectively. The full list of hyper-parameters for training the OA-ARDMs and the GNN models can be found in our code repository https://github.com/marsico-lab/arrow-diff. |