reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Random Walk Diffusion for Efficient Large-Scale Graph Generation

Authors: Tobias Bernecker, Ghalia Rehawi, Francesco Paolo Casale, Janine Knauer-Arloth, Annalisa Marsico

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we compare ARROW-Diff to six baseline methods for graph generation that can scale to large graphs. These methods cover the different graph generation techniques which are explained in Section 1 and Section 2. To evaluate the performance of all methods, we use five different real-world citation graph datasets, each containing a single undirected graph and all considered large graphs with varying numbers of nodes/edges. We also evaluate the performance of ARROW-Diff in generating more structured and nonscale-free graphs.
Researcher Affiliation	Academia	Tobias Bernecker EMAIL Computational Health Center, Helmholtz Munich School of Computation, Information and Technology, Technical University of Munich. Ghalia Rehawi EMAIL Computational Health Center, Helmholtz Munich Max Planck Institute of Psychiatry, Munich School of Life Sciences, Technical University of Munich. Francesco Paolo Casale EMAIL Computational Health Center, Helmholtz Munich Helmholtz Pioneer Campus, Helmholtz Munich School of Computation, Information and Technology, Technical University of Munich. Janine Knauer-Arloth EMAIL Max Planck Institute of Psychiatry, Munich Computational Health Center, Helmholtz Munich. Annalisa Marsico EMAIL Computational Health Center, Helmholtz Munich
Pseudocode	Yes	Algorithm 1 Optimizing Random Walk OA-ARDMs 1: Input: A random walk x V D, the number of nodes N = \|V \|, and a network f. 2: Output: ELBO L. 3: Sample t U(1, . . . , D) 4: Sample σ U(SD) 5: Compute m (σ < t) 6: Compute i m x + (1 m) ((N + 1) 1D) 7: l (1 m) log C(x\|f(i, t)) 8: Lt 1 D t+1sum(l) Algorithm 2 ARROW-Diff Graph Generation 1: Input: A trained OA-ARDM, a trained GNN. The node set V , features X and degrees d G of an original graph G with the same node ordering as for training the OA-ARDM. The number of steps L to generate the graph. 2: Output: A generated graph ˆG = (V, ˆE). 3: Start with an empty graph ˆG = (V, ˆE), where ˆE = 4: Set the initial start nodes Vstart to all nodes in the graph: Vstart = V 5: for l = 1 to L do 6: Sample one random walk for each start node n Vstart using the OA-ARDM: R 7: Compute edge proposals from R: ˆEproposals := {(ni, nj) R\|ni, nj V, i = j} 8: Run the GNN on G = (V, ˆE ˆEproposals, X) to obtain probabilities for all edges in ˆE ˆEproposals 9: Sample valid edges ˆEvalid from ˆE ˆEproposals according to the edge probabilities 10: Edge update: ˆE ˆEvalid 11: if l < L then 12: Compute the node degrees d ˆ G of ˆG based on ˆE 13: Compute d := max(0, d G d ˆ G) 14: Compute a probability for each node n V : p(n) = dn max(d) 15: Sample start nodes Vstart from V according to p(n) using a Bernoulli distribution 16: end if 17: end for
Open Source Code	Yes	The full list of hyper-parameters for training the OA-ARDMs and the GNN models can be found in our code repository https://github.com/marsico-lab/arrow-diff.
Open Datasets	Yes	To evaluate ARROW-Diff, we use five citation graph datasets: Cora-ML (Mc Callum et al., 2000), Cora (Mc Callum et al., 2000), Cite Seer (Giles et al., 1998), DBLP (Pan et al., 2016), and Pub Med (Sen et al., 2008). For Cora-ML and Cora, we use the pre-processed versions from Bojchevski & Günnemann (2018).
Dataset Splits	Yes	Similar to Bojchevski et al. (2018), we split the edges of each graph into training, validation, and test parts, and use only the training edges to train the baseline methods, as well as the OA-ARDM and GNN for ARROW-Diff.
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) are provided. The paper only mentions 'GPU memory' in the context of reducing batch size.
Software Dependencies	No	No specific software dependencies with version numbers are mentioned. The paper refers to architectures like U-Net, ResNet, BERT-like, and GCN, but not the software libraries or their versions used for implementation.
Experiment Setup	Yes	Specifically, we use a U-Net (Ronneberger et al., 2015) architecture similar to Ho et al. (2020) with one Res Net block and two levels for the down- and up-sampling processes. Similar to Bojchevski et al. (2018) we set the random walk length, which equals the number of diffusion steps, to D = 16 and sample batches of random walks from the training split comprising edges from the original graph. Each node in the random walks, as well as each diffusion time step, is then represented using 64-dimensional learnable embeddings and passed as input to the U-Net. For the GNN component, we train a two-layer GCN (Kipf & Welling, 2016a) to predict the validity of edges based on perturbed versions of the training split of the input graph. ... The positional encodings are 64-dimensional for Cora-ML and Cite Seer, and 128-dimensional for Cora, DBLP, and Pub Med. The GCN uses node embeddings of sizes 100 and 10 in the hidden and output layer, respectively. The full list of hyper-parameters for training the OA-ARDMs and the GNN models can be found in our code repository https://github.com/marsico-lab/arrow-diff.