What Is a Good Question? Assessing Question Quality via Meta-Fact Checking
Authors: Bo Zhang, Jianghua Zhu, Chaozhuo Li, Hao Yu, Li Kong, Zhan Wang, Dezhuang Miao, Xiaoming Zhang, Junsheng Zhou
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across multiple datasets and LLMs demonstrate that MFC significantly improves the accuracy and efficiency of both question answering and assessing. This research marks a pioneering effort to automate the evaluation of question quality based on cognitive capabilities. The paper includes sections such as 'Experiments', 'Multi-hop Reasoning Results', 'Question Quality Assessment Results', and 'Ablation Study' which involve evaluating performance on datasets and comparing with baselines. |
| Researcher Affiliation | Academia | 1School of Computer and Electronic Information/School of Artificial Intelligence, Nanjing Normal University, China 2Key Laboratory of Trustworthy Distributed Computing and Service (Mo E), Beijing University of Posts and Telecommunications, China 3School of Cyber Science and Technology, Beihang University, China |
| Pseudocode | No | The paper describes the Meta-Fact Checking (MFC) methodology through textual descriptions and diagrams (Figure 2), but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | The code and data are available at https://github.com/gregbuaa/qqa-mfc. |
| Open Datasets | Yes | To rigorously evaluate the multi-hop reasoning capabilities of MFC, we evaluate its performance on four extensively utilized Knowledge Base Question Answering (KBQA) datasets: Web QSP (Yih et al. 2016), CWQ (Talmor and Berant 2018), Simple Question (Sim Qu) (Bordes et al. 2015) and Grail QA (Gu et al. 2021b). We have developed a question quality dataset named CWQ-QQA, derived from CWQ... The code and data are available at https://github.com/gregbuaa/qqa-mfc. Freebase (Bollacker et al. 2008) is used as the KG. |
| Dataset Splits | Yes | The CWQ-QQA dataset was divided, with 40% allocated for fine-tuning the Llama2-7B model and the remaining 60% used for evaluation of both the Llama2-7B model and MFC. |
| Hardware Specification | No | The paper mentions using GPT-3.5-turbo and GPT-4-turbo models from Open AI, and fine-tuning a Llama2-7B model. However, it does not specify the hardware used by the authors to run their experiments (e.g., GPU/CPU models, memory, or processing power). |
| Software Dependencies | No | The paper refers to using GPT-3.5-turbo and GPT-4-turbo models from Open AI, and fine-tuning a Llama2-7B model. It also mentions setting the temperature to 0.2 and maximum generated text length to 512. However, it does not specify programming languages, libraries, or other software dependencies with version numbers. |
| Experiment Setup | Yes | To guarantee the reproducibility of the experiments, the temperature of the sampling is set to 0.2, and the maximum length of the generated text is set to 512. The parameter Dmax is defined as |E0| Depth, where |E0| represents the number of topic entities. In all experiments, the default value for Depth is set to 3, taking into account that the interaction with KGs is proportional to |E0|. The number of relevant relations Top-K is to set 3. The CWQ-QQA dataset was divided, with 40% allocated for fine-tuning the Llama2-7B model and the remaining 60% used for evaluation of both the Llama2-7B model and MFC. |