reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DocKS-RAG: Optimizing Document-Level Relation Extraction through LLM-Enhanced Hybrid Prompt Tuning

Authors: Xiaolong Xu, Yibo Zhou, Haolong Xiang, Xiaoyong Li, Xuyun Zhang, Lianyong Qi, Wanchun Dou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, extensive experiments conducted on two benchmark datasets demonstrate that our proposed framework enhances all the metrics compared with state-of-the-art methods.
Researcher Affiliation	Academia	1School of software, Nanjing University of Information Science and Technology, China 2College of Meteorology and Oceanography, National University of Defense Technology, China 3School of Computing, Macquarie University, Australia 4College of Computer Science and Technology, China University of Petroleum (East China), China 5State Key Laboratory for Novel Software Technology, Nanjing University, China. Correspondence to: Haolong Xiang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Hybrid-Prompt Tuning on Large Language Models 1: Input: Observable documents Du, Given triplets Tu, Graph model Gm, Embedding model Em, Informative generation function Gen, Recycling times T, Parameters of adapter A, User query Q. 2: Output: Predicted triplets Tp. 3: KG Train (Gm, Tu) 4: Segment Du into individual sentences and obtain the sentence sets Su. 5: KB Em (Su) 6: Retrieve the relevant subgraph Gq and Top K sentences Sk from KG and KB according Q. 7: Pg Gen (Gq), Ps Gen (Sk) 8: Ph Concat(Pg, Ps) 9: for t = 1 to T do 10: LLMnew Train (LLMold, Ph, A) 11: Update parameters of adapter A 12: end for 13: Tp LLMnew ( ˆ Ph) 14: Return: Tp
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	Finally, extensive experiments are conducted on two open benchmark datasets, Doc RED (Yao et al., 2019) and Re-Doc RED (Tan et al., 2022a), to further evaluate the performance of our proposed framework.
Dataset Splits	No	We evaluate the performance of our proposed Doc KS-RAG on two widely used benchmark datasets: Doc RED (Yao et al., 2019) and Re-Doc RED (Tan et al., 2022a). We refer the reader to Appendix A.1 for more detailed description of both the datasets. The paper mentions "development (Dev) and test (Test) results" but does not provide specific percentages or sample counts for these splits, nor does it explicitly state the methodology for creating these splits from the benchmark datasets.
Hardware Specification	Yes	All experiments conducted in this paper are implemented by PyTorch, and are trained on four 24GB RTX 3090 GPUs.
Software Dependencies	No	All experiments conducted in this paper are implemented by PyTorch, and are trained on four 24GB RTX 3090 GPUs. We harness BGE as embedding model for sentence representation. In the semantic retrieval component, Top5 relevant sentences will be retrieved associated with the user query, which are further transferred into informative prompts with maximum length of 512 tokens. We adopt LLaMA3-8B as the backbone LLM. LORA is chosen as the adapter training method to fine-tune LLaMA3-8B. The paper mentions several software components (PyTorch, BGE, LLaMA3-8B, LORA) but does not provide specific version numbers for any of them.
Experiment Setup	Yes	Document-Level Knowledge Graph Construction and Retrieval. ... both the entity and relation embedding dimensions are 128, and the network is initialized with 8 convolutional layers... Dropout is set at a rate of 0.1 in the convolutional layers... Adam is employed as the optimizer. Learning rate is set to 1e-3, and weight decay is 5e-4. ... We establish a minimum threshold at 0.7 for measuring the similarity... We transfer the relevant subgraphs to informative prompts with a maximum length of 128 tokens. Sentence-Level Knowledge Base Construction and Semantic Retrieval. ... Top5 relevant sentences will be retrieved associated with the user query, which are further transferred into informative prompts with maximum length of 512 tokens. Hybrid-Prompt Tuning. We adopt LLaMA3-8B as the backbone LLM. LORA is chosen as the adapter training method to fine-tune LLaMA3-8B. The low-rank setting of the adapter is 16, along with the warmup ratio of 0.1. We set the learning rate to 5e-5, and the batch size is 2 for both the training and evaluation stage. We utilize a learning rate scheduler with cosine variations for training, and we conduct fine-tuning for 3 epochs.