reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reasoning of Large Language Models over Knowledge Graphs with Super-Relations

Authors: Song Wang, Junhong Lin, Xiaojie Guo, Julian Shun, Jundong Li, Yada Zhu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on nine real-world datasets to evaluate Re Kno S, and the results demonstrate the superior performance of Re Kno S over existing state-of-the-art baselines, with an average accuracy gain of 2.92%.
Researcher Affiliation	Collaboration	Song Wang1 , Junhong Lin2, Xiaojie Guo3, Julian Shun2, Jundong Li1, Yada Zhu3 1University of Virginia, 2MIT CSAIL, 3IBM EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the framework steps and reasoning process in sections (e.g., '4 SUPER-RELATION REASONING', '4.1 CANDIDATE SELECTION', '4.2 SCORING'), but it does not present a formal pseudocode block or algorithm.
Open Source Code	Yes	Our code is provided at https://github.com/SongW-SW/REKNOS.
Open Datasets	Yes	To evaluate the performance of our framework on multi-hop knowledge-intensive reasoning tasks, we conduct tests using four KBQA datasets: CWQ (Talmor & Berant, 2018), Web QSP (Yih et al., 2016), Grail QA (Gu et al., 2021), and Simple QA (Bordes et al., 2015). Additionally, we include one open-domain QA dataset, Web Q (Berant et al., 2013), two slot-filling datasets, T-REx (Elsahar et al., 2018) and Zero-Shot RE (Petroni et al., 2021), one multi-hop complex QA dataset, Hotpot QA (Yang et al., 2018), and one fact-checking dataset, Creak (Onoe et al., 2021). For the larger datasets, Grail QA and Simple QA, 1,000 samples were randomly selected for testing to reduce computational overhead. For all of the datasets, we use exact match accuracy (Hits@1) as our evaluation metric, consistent with prior studies (Li et al., 2024; Jiang et al., 2023b). We use Freebase (Bollacker et al., 2008) as the KG for CWQ, Web QSP, Grail QA, Simple QA, and Web Q. We use Wikidata (Vrandeˇci c & Krötzsch, 2014) as the KG for T-REx, Zero-Shot RE, Hotpot QA, and Creak.
Dataset Splits	No	For the larger datasets, Grail QA and Simple QA, 1,000 samples were randomly selected for testing to reduce computational overhead. This specifies a portion of the test set for two datasets, but does not provide complete, reproducible training/validation/test splits for all datasets used in the experiments.
Hardware Specification	Yes	We run all of our experiments on one NVIDIA A6000 GPU with 48GB of memory.
Software Dependencies	No	The paper mentions specific LLM models used (e.g., GPT-3.5, GPT-4o-mini, Llama-2-7B, Mistral-7B, Llama-3-8B, GPT-4) but does not list specific software dependencies like programming languages, libraries, or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Across all datasets and methods, we set the width N to 3 and the maximum length L to 3. When prompting the LLM to score super-relations, we use 3 examples as in-context learning demonstrations, following the existing work on To G (Sun et al., 2023).