What Is a Good Question? Assessing Question Quality via Meta-Fact Checking

Authors: Bo Zhang, Jianghua Zhu, Chaozhuo Li, Hao Yu, Li Kong, Zhan Wang, Dezhuang Miao, Xiaoming Zhang, Junsheng Zhou

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple datasets and LLMs demonstrate that MFC significantly improves the accuracy and efficiency of both question answering and assessing. This research marks a pioneering effort to automate the evaluation of question quality based on cognitive capabilities. The paper includes sections such as 'Experiments', 'Multi-hop Reasoning Results', 'Question Quality Assessment Results', and 'Ablation Study' which involve evaluating performance on datasets and comparing with baselines.
Researcher Affiliation Academia 1School of Computer and Electronic Information/School of Artificial Intelligence, Nanjing Normal University, China 2Key Laboratory of Trustworthy Distributed Computing and Service (Mo E), Beijing University of Posts and Telecommunications, China 3School of Cyber Science and Technology, Beihang University, China
Pseudocode No The paper describes the Meta-Fact Checking (MFC) methodology through textual descriptions and diagrams (Figure 2), but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes The code and data are available at https://github.com/gregbuaa/qqa-mfc.
Open Datasets Yes To rigorously evaluate the multi-hop reasoning capabilities of MFC, we evaluate its performance on four extensively utilized Knowledge Base Question Answering (KBQA) datasets: Web QSP (Yih et al. 2016), CWQ (Talmor and Berant 2018), Simple Question (Sim Qu) (Bordes et al. 2015) and Grail QA (Gu et al. 2021b). We have developed a question quality dataset named CWQ-QQA, derived from CWQ... The code and data are available at https://github.com/gregbuaa/qqa-mfc. Freebase (Bollacker et al. 2008) is used as the KG.
Dataset Splits Yes The CWQ-QQA dataset was divided, with 40% allocated for fine-tuning the Llama2-7B model and the remaining 60% used for evaluation of both the Llama2-7B model and MFC.
Hardware Specification No The paper mentions using GPT-3.5-turbo and GPT-4-turbo models from Open AI, and fine-tuning a Llama2-7B model. However, it does not specify the hardware used by the authors to run their experiments (e.g., GPU/CPU models, memory, or processing power).
Software Dependencies No The paper refers to using GPT-3.5-turbo and GPT-4-turbo models from Open AI, and fine-tuning a Llama2-7B model. It also mentions setting the temperature to 0.2 and maximum generated text length to 512. However, it does not specify programming languages, libraries, or other software dependencies with version numbers.
Experiment Setup Yes To guarantee the reproducibility of the experiments, the temperature of the sampling is set to 0.2, and the maximum length of the generated text is set to 512. The parameter Dmax is defined as |E0| Depth, where |E0| represents the number of topic entities. In all experiments, the default value for Depth is set to 3, taking into account that the interaction with KGs is proportional to |E0|. The number of relevant relations Top-K is to set 3. The CWQ-QQA dataset was divided, with 40% allocated for fine-tuning the Llama2-7B model and the remaining 60% used for evaluation of both the Llama2-7B model and MFC.