Explore What LLM Does Not Know in Complex Question Answering

Authors: Xin Lin, Zhenya Huang, Zhiqiang Zhang, Jun Zhou, Enhong Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on four widely-used QA datasets, and the results demonstrate the effectiveness of the proposed method.
Researcher Affiliation Academia 1School of Computer Science and Technology, University of Science and Technology of China, Hefei, China 2State Key Laboratory of Cognitive Intelligence, Hefei, China 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China 4Zhejiang University, Hangzhou, China 5Independent Researcher EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Consistency-based assessment Algorithm 2: KEQA inference
Open Source Code Yes Our codes are available at https://github.com/l-xin/KEQA.
Open Datasets Yes We use four benchmarks for QA including both one-hop and multi-hop QA tasks. We use the Natural Questions (NQ) (Kwiatkowski et al. 2019) for one-hop QA, and Strategy QA (Geva et al. 2021), Hotpot QA (Yang et al. 2018) and 2Wiki Multihop QA (2WMQA) (Ho et al. 2020) for multi-hop QA.
Dataset Splits Yes We use the train data of Strategy QA and dev data of other datasets, and sample 500 instances for each dataset to reduce the costs of running experiments following previous work (Trivedi et al. 2023; Jiang et al. 2023).
Hardware Specification Yes We run all experiments on a Linux server with two 2.20 GHz Intel Xeon E5-2650 CPUs and an NVIDIA A100 GPU.
Software Dependencies No We use gpt-3.5-turbo2 as the LLM L, and BM25 algorithm implemented in Elasticsearch3 as the retriever R following (Jiang et al. 2023; Trivedi et al. 2023). We use Wikipedia dump from Dec 20, 2018 in (Karpukhin et al. 2020) as the knowledge source K following (Jiang et al. 2023; Asai et al. 2024). For the semantic and utility discriminator Ds and Du, we both adopt Llama-2-7b-chat-hf4. Reference retriever Ru is implemented with Bert and FAISS. Explanation: While specific models like 'Llama-2-7b-chat-hf' are mentioned, other key components like 'Elasticsearch', 'gpt-3.5-turbo', 'FAISS', and 'Bert' are mentioned without explicit version numbers, which is required for reproducibility.
Experiment Setup Yes In QKE, we set Nc and α for consistency to 5 and 0.8. In UKP, we retrieve top-10 candidate knowledge from K before knowledge picking, and top-8 demonstrations as R from R.