reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LLM4VKG: Leveraging Large Language Models for Virtual Knowledge Graph Construction

Authors: Guohui Xiao, Lin Ren, Guilin Qi, Haohan Xue, Marco Di Panfilo, Davide Lanti

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental evaluation on the RODI benchmark demonstrates that LLM4VKG surpasses state-of-the-art methods, achieving an average F1-score improvement of +17% and a peak gain of +39%.
Researcher Affiliation	Academia	1School of Computer Science and Engineering, Southeast University, Nanjing, China 2Free University of Bozen-Bolzano, Italy
Pseudocode	No	The paper describes the methodology in prose and includes SPARQL queries in Section 4.1, but it does not contain any clearly labeled pseudocode or algorithm blocks describing the LLM4VKG framework or its components.
Open Source Code	Yes	All code and datasets associated with this work are publicly available.1 1https://github.com/Homura T/LLM4VKG
Open Datasets	Yes	We evaluate LLM4VKG on RODI [Pinkel et al., 2018] and RODI-T (x%), a variant of RODI in which x% of the ontology vocabulary is removed from the ontology starting from the leaf nodes.
Dataset Splits	No	The paper describes the structure of the RODI benchmark samples and how queries are evaluated, but it does not provide specific details on how the dataset itself is split into training, validation, or test sets for the LLM4VKG's operation or evaluation phases. It states 'A RODI sample comprises three main elements: a database schema, a golden ontology, and a set of query pairs', indicating the nature of the samples, but not their partitioning for experimentation.
Hardware Specification	No	The paper thanks the Big Data Computing Center of Southeast University for 'facility support on the numerical calculations' but does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments. It mentions using various LLMs (GPT-4o, Qwen2.5-7b) as backbone models but not the hardware on which these models were utilized for the experiments.
Software Dependencies	Yes	In this study, we leverage the VKG system Ontop [Calvanese et al., 2017]6 to initialize the generated VKG based on a database connection, an ontology, and a set of mappings. For Retriever, we use bge-m3 [Chen et al., 2024] as the backbone model. For Matcher and Namer, we incorporate GPT-4o, GPT-4o-mini [Open AI, 2024a], and Qwen2.5-7b [Qwen, 2024] as backbone models, representing various levels of performance across LLMs.
Experiment Setup	No	The paper mentions that the Retriever module uses a pre-trained sentence similarity language model to retrieve 'top-n candidate elements' where 'n is a hyperparameter,' but the specific value for 'n' is not provided. It also states that 'The detailed prompts for the modules are in Appendix B,' implying some experimental details are not in the main text. Specific hyperparameters like learning rates, batch sizes, or optimizer settings are not described in the main body of the paper.