reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hierarchical Language Model Design For Interpretable Graph Reasoning

Authors: Sambhav Khurana, Xiner Li, Shurui Gui, Shuiwang Ji

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments to investigate four specific research questions (RQs) to assess the effectiveness of our model on graph tasks: RQ1: Can our model accurately understand the underlying structures and maintain robust performance across different graph reasoning datasets? RQ2: Does our approach enhance interpretability performance and produce intrinsic interpretable results? RQ3: Can the proposed method handle complex real-world datasets with diverse node or edge features? RQ4: Does the proposed method work well across all node, link and graph level tasks?
Researcher Affiliation	Academia	Sambhav Khurana EMAIL Department of Computer Science & Engineering Texas A&M University Xiner Li EMAIL Department of Computer Science & Engineering Texas A&M University Shurui Gui EMAIL Department of Computer Science & Engineering Texas A&M University Shuiwang Ji EMAIL Department of Computer Science & Engineering Texas A&M University
Pseudocode	No	The paper describes the hierarchical language model design with mathematical formulations for attention mechanisms and pooling layers, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block with structured, code-like steps.
Open Source Code	No	Our implementation is under the architecture of Py Torch (Paszke et al., 2019) and Py G (Fey & Lenssen, 2019). The paper mentions the frameworks used but does not provide an explicit statement about releasing their specific code, nor a link to a code repository.
Open Datasets	Yes	To answer the RQ3 and RQ4, we curated seven graph datasets widely recognized in the graph learning community, varying in scale, domains, and task types. We adopt Arxiv (Hu et al., 2020b), Cora (Bojchevski & Günnemann, 2018), and Pubmed (Sen et al., 2008) for node-level tasks; Pubmed, WN18RR (Bordes et al., 2013), and FB15k-237 (Bordes et al., 2013) for link-level tasks; and molhiv (Hu et al., 2020a) for graph-level tasks.
Dataset Splits	Yes	For our study, we focus on node-level prediction, specifically predicting the category of each paper based on its features and structure. We use the 60-20-20 random split for training, validation and testing. For node classification, we use a 60-20-20 random split for training, validation, and testing. For link classification, Following the methodology of OFA (Liu et al., 2024a), we use an 85-5-10 random split. We follow the standard split for this dataset: training on papers published until 2017, validating on those from 2018, and testing on papers published since 2019. A random split of 80/10/10 is used for training , validation and test sets.
Hardware Specification	Yes	The deployment environments are Ubuntu 18.04 with 48 Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, 755GB Memory, and graphics cards NVIDIA RTX A6000.
Software Dependencies	No	Our implementation is under the architecture of Py Torch (Paszke et al., 2019) and Py G (Fey & Lenssen, 2019). The paper mentions the software frameworks used but does not specify their version numbers, which are critical for reproducibility.
Experiment Setup	Yes	We adopt the Adam optimizer (Kingma & Ba, 2014) throughout the training phase, with a learning rate of 5e 6, weight decay of 0.1, β1 = 0.9, and β2 = 0.95. Across all datasets, the training consists of 5 epochs, with a batch size of 16 for graph reasoning datasets and 8 for real-world datasets. In the local block, we employ a BERT-like architecture utilizing a special intra-node masking scheme... we use 4 local block layers. For the global block, we utilize 2 layers for most datasets, except for the Shortest Distance, Edge Count and Number of Connected Components datasets, where 3 global block layers are used. Our observations indicate that more complex tasks benefit from an increased number of global block layers, which enhances overall performance. The shared parameters for all tasks and datasets used in our language model ML(G) are summarized in Table 8. Parameter: Activation gelu, Attention Dropout 0.1, Dimension 768, Dropout 0.1, Hidden Dimension 3072, Max Position Embeddings 4096, Number of Heads 12, Number of Local Block Layers 6. The number of higher block layers for each dataset is set as follows: 4 for Cora, Pubmed, WN18RR, and FB15k-237, 2 for molhiv, and 6 for Arxiv.