reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What Is a Good Question? Assessing Question Quality via Meta-Fact Checking

Authors: Bo Zhang, Jianghua Zhu, Chaozhuo Li, Hao Yu, Li Kong, Zhan Wang, Dezhuang Miao, Xiaoming Zhang, Junsheng Zhou

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across multiple datasets and LLMs demonstrate that MFC significantly improves the accuracy and efficiency of both question answering and assessing. This research marks a pioneering effort to automate the evaluation of question quality based on cognitive capabilities. The paper includes sections such as 'Experiments', 'Multi-hop Reasoning Results', 'Question Quality Assessment Results', and 'Ablation Study' which involve evaluating performance on datasets and comparing with baselines.
Researcher Affiliation	Academia	1School of Computer and Electronic Information/School of Artificial Intelligence, Nanjing Normal University, China 2Key Laboratory of Trustworthy Distributed Computing and Service (Mo E), Beijing University of Posts and Telecommunications, China 3School of Cyber Science and Technology, Beihang University, China
Pseudocode	No	The paper describes the Meta-Fact Checking (MFC) methodology through textual descriptions and diagrams (Figure 2), but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	The code and data are available at https://github.com/gregbuaa/qqa-mfc.
Open Datasets	Yes	To rigorously evaluate the multi-hop reasoning capabilities of MFC, we evaluate its performance on four extensively utilized Knowledge Base Question Answering (KBQA) datasets: Web QSP (Yih et al. 2016), CWQ (Talmor and Berant 2018), Simple Question (Sim Qu) (Bordes et al. 2015) and Grail QA (Gu et al. 2021b). We have developed a question quality dataset named CWQ-QQA, derived from CWQ... The code and data are available at https://github.com/gregbuaa/qqa-mfc. Freebase (Bollacker et al. 2008) is used as the KG.
Dataset Splits	Yes	The CWQ-QQA dataset was divided, with 40% allocated for fine-tuning the Llama2-7B model and the remaining 60% used for evaluation of both the Llama2-7B model and MFC.
Hardware Specification	No	The paper mentions using GPT-3.5-turbo and GPT-4-turbo models from Open AI, and fine-tuning a Llama2-7B model. However, it does not specify the hardware used by the authors to run their experiments (e.g., GPU/CPU models, memory, or processing power).
Software Dependencies	No	The paper refers to using GPT-3.5-turbo and GPT-4-turbo models from Open AI, and fine-tuning a Llama2-7B model. It also mentions setting the temperature to 0.2 and maximum generated text length to 512. However, it does not specify programming languages, libraries, or other software dependencies with version numbers.
Experiment Setup	Yes	To guarantee the reproducibility of the experiments, the temperature of the sampling is set to 0.2, and the maximum length of the generated text is set to 512. The parameter Dmax is defined as \|E0\| Depth, where \|E0\| represents the number of topic entities. In all experiments, the default value for Depth is set to 3, taking into account that the interaction with KGs is proportional to \|E0\|. The number of relevant relations Top-K is to set 3. The CWQ-QQA dataset was divided, with 40% allocated for fine-tuning the Llama2-7B model and the remaining 60% used for evaluation of both the Llama2-7B model and MFC.