Reasoning of Large Language Models over Knowledge Graphs with Super-Relations

Authors: Song Wang, Junhong Lin, Xiaojie Guo, Julian Shun, Jundong Li, Yada Zhu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on nine real-world datasets to evaluate Re Kno S, and the results demonstrate the superior performance of Re Kno S over existing state-of-the-art baselines, with an average accuracy gain of 2.92%.
Researcher Affiliation Collaboration Song Wang1 , Junhong Lin2, Xiaojie Guo3, Julian Shun2, Jundong Li1, Yada Zhu3 1University of Virginia, 2MIT CSAIL, 3IBM EMAIL EMAIL EMAIL
Pseudocode No The paper describes the framework steps and reasoning process in sections (e.g., '4 SUPER-RELATION REASONING', '4.1 CANDIDATE SELECTION', '4.2 SCORING'), but it does not present a formal pseudocode block or algorithm.
Open Source Code Yes Our code is provided at https://github.com/SongW-SW/REKNOS.
Open Datasets Yes To evaluate the performance of our framework on multi-hop knowledge-intensive reasoning tasks, we conduct tests using four KBQA datasets: CWQ (Talmor & Berant, 2018), Web QSP (Yih et al., 2016), Grail QA (Gu et al., 2021), and Simple QA (Bordes et al., 2015). Additionally, we include one open-domain QA dataset, Web Q (Berant et al., 2013), two slot-filling datasets, T-REx (Elsahar et al., 2018) and Zero-Shot RE (Petroni et al., 2021), one multi-hop complex QA dataset, Hotpot QA (Yang et al., 2018), and one fact-checking dataset, Creak (Onoe et al., 2021). For the larger datasets, Grail QA and Simple QA, 1,000 samples were randomly selected for testing to reduce computational overhead. For all of the datasets, we use exact match accuracy (Hits@1) as our evaluation metric, consistent with prior studies (Li et al., 2024; Jiang et al., 2023b). We use Freebase (Bollacker et al., 2008) as the KG for CWQ, Web QSP, Grail QA, Simple QA, and Web Q. We use Wikidata (Vrandeˇci c & Krötzsch, 2014) as the KG for T-REx, Zero-Shot RE, Hotpot QA, and Creak.
Dataset Splits No For the larger datasets, Grail QA and Simple QA, 1,000 samples were randomly selected for testing to reduce computational overhead. This specifies a portion of the test set for two datasets, but does not provide complete, reproducible training/validation/test splits for all datasets used in the experiments.
Hardware Specification Yes We run all of our experiments on one NVIDIA A6000 GPU with 48GB of memory.
Software Dependencies No The paper mentions specific LLM models used (e.g., GPT-3.5, GPT-4o-mini, Llama-2-7B, Mistral-7B, Llama-3-8B, GPT-4) but does not list specific software dependencies like programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes Across all datasets and methods, we set the width N to 3 and the maximum length L to 3. When prompting the LLM to score super-relations, we use 3 examples as in-context learning demonstrations, following the existing work on To G (Sun et al., 2023).