reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SetKE: Knowledge Editing for Knowledge Elements Overlap

Authors: Yifan Wei, Xiaoyan Yu, Ran Song, Hao Peng, Angsheng Li

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Set KE outperforms existing methods in KEO scenarios on mainstream LLMs. Additionally, we introduce EDITSET, a dataset containing KEO triplets, providing a comprehensive benchmark.
Researcher Affiliation	Academia	1State Key Laboratory of CCSE, School of Computer Science and Engineering, Beihang University 2School of Computer Science and Technology, Beijing Institute of Technology 3Kunming University of Science and Technology EMAIL, EMAIL, song EMAIL
Pseudocode	Yes	The Hungarian algorithm guarantees finding the optimal matching in O(N 3) time complexity, as shown in Appendix Algorithm 1.
Open Source Code	No	The paper does not contain an explicit statement about releasing open-source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We propose a novel formulation of Knowledge Set Editing (KSE) and construct a new dataset, EDITSET, to facilitate in-depth exploration of Knowledge Element Overlap (KEO)... Building on this observation, we collect KEO instances from Wikidata to construct a new dataset, EDITSET, enabling a more comprehensive exploration of KEO in KE. The dataset comprises over 700 relation types, with our study focusing on the 31 most common ones, consistent with prior research [Levy et al., 2017; Elazar et al., 2021; Meng et al., 2022a; Zhong et al., 2023; Wei et al., 2024; Yin et al., 2024; Ma et al., 2024].
Dataset Splits	Yes	The counterfactual prompt is employed to assess Efficacy, the paraphrase prompt for Generalization, and the neighborhood prompt for Locality. ... The EDITSET dataset consists of three types of prompt, Counter.P., Para.P., and Neigh.P. denote Counterfactual Prompt, Paraphrase Prompt, and Neighborhood Prompt, respectively.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used to run the experiments.
Software Dependencies	No	The paper mentions using Large Language Models like GPT2-Large, GPT2-XL, and GPT-J, but does not provide specific software dependencies (e.g., library names with version numbers like PyTorch, TensorFlow, or CUDA versions) used for implementation.
Experiment Setup	Yes	Evaluation Metrics The evaluation metrics for the new formulation of KSE remain consistent with previous works (where the object is singular) [Meng et al., 2022a; Meng et al., 2022b]... Language Models We employ two widely adopted autoregressive language models, namely GPT2-Large (760M), GPT2-XL (1.5B) and GPT-J (6B) [Radford et al., 2019], as the base language models to perform editing and assess the effectiveness of the KE approaches. Baselines We select the following approaches: FT-W is a basic fine-tuning method. KN [Dai et al., 2022]... MEND [Mitchell et al., 2021]... ROME [Meng et al., 2022a]... MEMIT [Meng et al., 2022b]... PMET [Li et al., 2024]...