reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Soft Unification with Knowledge Graph Embedding Methods

Authors: Xuanming Cui, Chionh Wei Peng, Adriel Kuek, Ser-Nam Lim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on popular link prediction datasets including Countries, Nations, UMLS and Kinship (Kemp et al., 2006). Following GNTP (Minervini et al., 2019) we experiment on FB122 (Guo et al., 2016a), WN18RR (Dettmers et al., 2018), and additionally Co DEx S (Safavi & Koutra, 2020). In Table 2, 4 and 5 we show link prediction results on the evaluated datasets. We provide detailed ablations to examine the key factors in the integration.
Researcher Affiliation	Academia	1Department of Computer Science, University of Central Florida 2DSO National Laboratories. Correspondence to: Xuanming Cui <EMAIL>.
Pseudocode	Yes	J. Pseudo-code implementation Algorithm 1 Python pseudo-code for NTP with top-k retrieval following implementation from (Minervini et al., 2019) Algorithm 2 Simplified Python pseudo-code for CTP following (Minervini et al., 2020)
Open Source Code	No	The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We conduct experiments on popular link prediction datasets including Countries, Nations, UMLS and Kinship (Kemp et al., 2006). Following GNTP (Minervini et al., 2019) we experiment on FB122 (Guo et al., 2016a), WN18RR (Dettmers et al., 2018), and additionally Co DEx S (Safavi & Koutra, 2020).
Dataset Splits	Yes	FB122 consists of two test splits: Test-I and Test-II, where Test-II contains the set of triplets that can be inferred via logic rules, and Test-I denotes all other triplets. We follow the same evaluation protocol as in GNTP and CTP, and report Mean Reciprocal Rank (MRR) and HITS@m under the filtered setting. In Table 10 we summarize the statistics of these datasets. Table 10. Dataset statistics Statistics of datasets used in this work. Columns: number of entities (\|E\|), number of predicates (\|R\|), number of training, validation, and test samples. Kinship (Kemp et al., 2006) ... 8544 1068 1074 Nations (Kemp et al., 2006) ... 1592 199 201 UMLS (Kemp et al., 2006) ... 5,216 652 661 FB122 (Guo et al., 2016a) ... 91,638 9595 11243 WN18RR (Dettmers et al., 2018) ... 86,835 3,034 3,134
Hardware Specification	Yes	In Figure 3, we show per-sample inference and evaluation time under CTP, CTP3 and CTP4. For inference, CTP3 requires 2 and 7 less time compared to CTP on FB122 and WN18RR dataset, while CTP4 reduces even further by 28 and 92 . For evaluation, CTP3 requires 2 less time than CTP on both datasets, while CTP4 reduces 942 and 1452 on FB122 and WN18RR. [...] on a NVIDIA V100 GPU with batch size = 512.
Software Dependencies	No	Indexing Library. In this work we use the FAISS search index (Johnson et al., 2019). We use the GPU version of the library and use the Index Flat L2 index, which performs exact search using L2. Although FAISS library is mentioned, no specific version number is provided for it or any other software dependency.
Experiment Setup	Yes	For hyper-parameters we follow CTP (Minervini et al., 2020) on Kinship and UMLS datasets for all the experiments. Specifically, we use embedding size=50, top-k=4, batch size=8, learning rate=0.1, trained 100 epochs with Adagrad optimizer. For each triplets we sample 3 negative sample per entity (a total of 9 negative samples per triplet). For Nations we use batch size=256 with Adam W optimizer for the CTP2 variant, and the same as CTP for the rest of models. For FB122 we mostly follow the setting from GNTP (Minervini et al., 2019), with embedding size=100, top-k=10, and 1 negative sample per entity. We use Adagrad optimizer and train 100 epochs.