reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Is Complex Query Answering Really Complex?

Authors: Cosimo Gregucci, Bo Xiong, Daniel Hernández, Lorenzo Loconte, Pasquale Minervini, Steffen Staab, Antonio Vergari

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a systematic empirical investigation, the new benchmarks show that current methods leave much to be desired from current CQA methods. We re-evaluate previous So TA approaches (Sec. 5), revealing that neural link predictors rely on memorized information from the training set.
Researcher Affiliation	Collaboration	1Institute for Artificial Intelligence, University of Stuttgart, Germany 2Stanford University 3School of Informatics University of Edinburgh Edinburgh, UK 4Miniml.AI 5University of Southampton, UK.
Pseudocode	No	The paper describes methods and procedures in narrative form, without explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Old and new benchmarks, the generation scripts, and the implementation of CQD-hybrid are included in our official repo.6 https://github.com/april-tools/is-cqa-complex
Open Datasets	Yes	Performance measured on de-factostandard benchmarks such as FB15k237 (Toutanova & Chen, 2015) and NELL995 (Xiong et al., 2017) suggests impressive progress achieved in recent years on CQA on queries having different structures... To these, we build ICEWS18+H from the temporal KG ICEWS18 (Boschee et al., 2015)...
Dataset Splits	Yes	To evaluate them, standard benchmarks such as FB15k237 and NELL995 artificially divide G into Gtrain and Gtest, treating the triples in the latter as missing links. To this end, we leverage the temporal information in ICEWS18 by (1) ordering the links based on their timestamp; (2) removing the temporal information, thus obtaining regular triples; and (3) selecting the train set to be the first temporally-ordered 80% of triples, the valid the next 10%, and the remaining to be the test split.
Hardware Specification	No	The paper mentions receiving compute time on "Hore Ka HPC (NHR@KIT)" but does not provide specific details on the GPU/CPU models, processor types, or memory used for the experiments.
Software Dependencies	No	The paper mentions various models and frameworks used (e.g., GNN-QE, ULTRAQ, CQD, ComplEx, ConE, CLMPT, QTO) but does not provide specific version numbers for these software components or their underlying libraries (e.g., PyTorch, TensorFlow, Python version, CUDA version).
Experiment Setup	Yes	CQD-specific hyperparameters, namely the CQD beam k , ranging from [2,512] and the t-norm type being prod or min In Table F.1 we provide the hyperparameter selection for the old benchmarks FB15k237 and NELL995. GNN-QE For GNN-QE, we tuned the following hyperparameters: (1) batchsize, with values 8 or 48, and concat hidden being True or False... CQD We train Compl Ex (Trouillon et al., 2017) link predictor with hyperparameters regweight 0.1 or 0.01, and batch size 1000 or 2000... CLMPT For CLMPT, we tuned the following hyperparameters: (1) learning rate, with values in [1e-5,5e-2,5e-3,5e-4,5e5,,5e-6], (2) temp, with values in [0.1, 0.2]...