Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models
Authors: Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Yuan-Fang Li, Chen Gong, Shirui Pan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on several KGQA benchmarks demonstrate that GCR achieves state-of-the-art performance and exhibits strong zero-shot generalizability to unseen KGs without additional training. |
| Researcher Affiliation | Academia | Monash University, Nanjing University of Science and Technology, Shanghai Jiao Tong University, Griffith University. Correspondence to: Shirui Pan <EMAIL>, Linhao Luo <EMAIL>. |
| Pseudocode | No | The paper describes methods and processes (e.g., Knowledge Graph Trie Construction, Graph-constrained Decoding, Graph Inductive Reasoning) in textual format and with mathematical formulations, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data are available at: https://github.com/RMan Luo/graph-constrained-reasoning |
| Open Datasets | Yes | Following previous research (Luo et al., 2024; Sun et al., 2024), we first evaluate the reasoning ability of GCR on two benchmark KGQA datasets: Web Question SP (Web QSP) (Yih et al., 2016) and Complex Web Questions (CWQ) (Talmor & Berant, 2018). Freebase (Bollacker et al., 2008) is adopted as the knowledge graph for both datasets. To further evaluate the generalizability of GCR, we conduct zero-shot transfer experiments on three new KGQA datasets: Freebase QA (Jiang et al., 2019), CSQA (Talmor et al., 2019) and Med QA (Jin et al., 2021). Freebase QA adopts the same Freebase KG. For CSQA, we use Concept Net (Speer et al., 2017) as the KG, while for Med QA, we use a medical KG constructed from the Unified Medical Language System (Yasunaga et al., 2021). |
| Dataset Splits | Yes | To ensure fairness, we adopt the same train and test splits as previous works (Jiang et al., 2022; Luo et al., 2024). Details of the datasets can be found in Table 11. ... The training data (q, wz, a) DG consists of questionanswer pairs and reasoning paths generated from KGs. We use the shortest paths connecting the entities in the question and answer as the reasoning path wz for training, where details can be found in Appendix C. An example of graph-constrained decoding is illustrated in Figure 3, where <PATH> and </PATH> are special tokens to control the start and end of graph-constrained decoding. Experiment results in Section 5.2 show that even a lightweight KG-specialized LLM (0.5B) can achieve satisfactory performance in KG reasoning. |
| Hardware Specification | Yes | The training is conducted on 2 A100-80G GPUs for each model. ... System settings overview for efficiency experiments. System Setting Specification CPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz |
| Software Dependencies | No | The paper mentions several software components like "Spacy", "Col BERTv2", "Llama-3-8B Tokenizer implemented by Huggingface.", "Python MARISA Trie", "Virtuoso SPARQL", and "Pickle". However, it does not provide specific version numbers for these software components, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | We fine-tune several lightweight LLMs ranging from 0.5B to 8B (Yang et al., 2024a; Touvron et al., 2023; Meta, 2024) on the fine-tuning datasets for 3 epochs. The batch size is set to 4 and the learning rate is set to 2e-5. We use the cosine learning rate scheduler policy with the warmup ratio set to 0.03. |