reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

Authors: Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Yuan-Fang Li, Chen Gong, Shirui Pan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on several KGQA benchmarks demonstrate that GCR achieves state-of-the-art performance and exhibits strong zero-shot generalizability to unseen KGs without additional training.
Researcher Affiliation	Academia	Monash University, Nanjing University of Science and Technology, Shanghai Jiao Tong University, Griffith University. Correspondence to: Shirui Pan <EMAIL>, Linhao Luo <EMAIL>.
Pseudocode	No	The paper describes methods and processes (e.g., Knowledge Graph Trie Construction, Graph-constrained Decoding, Graph Inductive Reasoning) in textual format and with mathematical formulations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data are available at: https://github.com/RMan Luo/graph-constrained-reasoning
Open Datasets	Yes	Following previous research (Luo et al., 2024; Sun et al., 2024), we first evaluate the reasoning ability of GCR on two benchmark KGQA datasets: Web Question SP (Web QSP) (Yih et al., 2016) and Complex Web Questions (CWQ) (Talmor & Berant, 2018). Freebase (Bollacker et al., 2008) is adopted as the knowledge graph for both datasets. To further evaluate the generalizability of GCR, we conduct zero-shot transfer experiments on three new KGQA datasets: Freebase QA (Jiang et al., 2019), CSQA (Talmor et al., 2019) and Med QA (Jin et al., 2021). Freebase QA adopts the same Freebase KG. For CSQA, we use Concept Net (Speer et al., 2017) as the KG, while for Med QA, we use a medical KG constructed from the Unified Medical Language System (Yasunaga et al., 2021).
Dataset Splits	Yes	To ensure fairness, we adopt the same train and test splits as previous works (Jiang et al., 2022; Luo et al., 2024). Details of the datasets can be found in Table 11. ... The training data (q, wz, a) DG consists of questionanswer pairs and reasoning paths generated from KGs. We use the shortest paths connecting the entities in the question and answer as the reasoning path wz for training, where details can be found in Appendix C. An example of graph-constrained decoding is illustrated in Figure 3, where <PATH> and </PATH> are special tokens to control the start and end of graph-constrained decoding. Experiment results in Section 5.2 show that even a lightweight KG-specialized LLM (0.5B) can achieve satisfactory performance in KG reasoning.
Hardware Specification	Yes	The training is conducted on 2 A100-80G GPUs for each model. ... System settings overview for efficiency experiments. System Setting Specification CPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
Software Dependencies	No	The paper mentions several software components like "Spacy", "Col BERTv2", "Llama-3-8B Tokenizer implemented by Huggingface.", "Python MARISA Trie", "Virtuoso SPARQL", and "Pickle". However, it does not provide specific version numbers for these software components, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	We fine-tune several lightweight LLMs ranging from 0.5B to 8B (Yang et al., 2024a; Touvron et al., 2023; Meta, 2024) on the fine-tuning datasets for 3 epochs. The batch size is set to 4 and the learning rate is set to 2e-5. We use the cosine learning rate scheduler policy with the warmup ratio set to 0.03.