reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Relation Extraction via Supervised Rationale Verification and Feedback

Authors: Yongqi Li, Xin Miao, Shen Zhou, Mayi Xu, Yuyang Ren, Tieyun Qian

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments prove that our proposed framework significantly outperforms existing methods. Extensive experiments demonstrate the superiority of our framework over existing methods. Table 1 reports the experimental results with various initial demonstration selection strategies on Llama-2-7b-chat on the Sem Eval, TACRED, and Re-TACRED datasets.
Researcher Affiliation	Collaboration	Yongqi Li1, Xin Miao1, Shen Zhou1, Mayi Xu1, Yuyang Ren1,3, Tieyun Qian1,2* 1School of Computer Science, Wuhan University, China 2Intellectual Computing Laboratory for Cultural Heritage, Wuhan University, China 3Research Institute of Nuclear Power Operation, China EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the proposed method in narrative text and uses figures to illustrate the framework and causal models, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/NLPGM/SRVF
Open Datasets	Yes	We adopt three commonly used datasets for RE, including Sem Eval (Hendrickx et al. 2010), TACRED (Zhang et al. 2017), and Re-TACRED (Stoica, Platanios, and P oczos 2021). Also, Doc RED (Yao et al. 2019) and Re-Doc RED (Tan et al. 2022).
Dataset Splits	Yes	Hence we adopt the k-shot ( k {5, 10, 20, 50}) settings to validate the effectiveness of the proposed method.
Hardware Specification	No	The paper evaluates its method using various LLMs (Llama-2-7b-chat, Llama-2-70b-chat, Meta-Llama-3-8B-Instruct, Meta-Llama-3-70B-Instruct, GPT-3.5-turbo), but does not specify the underlying hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper does not explicitly state specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	For Self-Consistency, GRACE, and ours, the number of iterations or candidate responses is set to 5 for fairness. For Self-Refine, the iteration number is set to 1 since we find that more iteration rounds result in performance degradation. Here we adopt the dot product as the similarity function sim() and add a temperature hyper-parameter τ to focus more on difficult pairs (Chen et al. 2020).