Explore What LLM Does Not Know in Complex Question Answering
Authors: Xin Lin, Zhenya Huang, Zhiqiang Zhang, Jun Zhou, Enhong Chen
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on four widely-used QA datasets, and the results demonstrate the effectiveness of the proposed method. |
| Researcher Affiliation | Academia | 1School of Computer Science and Technology, University of Science and Technology of China, Hefei, China 2State Key Laboratory of Cognitive Intelligence, Hefei, China 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China 4Zhejiang University, Hangzhou, China 5Independent Researcher EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Consistency-based assessment Algorithm 2: KEQA inference |
| Open Source Code | Yes | Our codes are available at https://github.com/l-xin/KEQA. |
| Open Datasets | Yes | We use four benchmarks for QA including both one-hop and multi-hop QA tasks. We use the Natural Questions (NQ) (Kwiatkowski et al. 2019) for one-hop QA, and Strategy QA (Geva et al. 2021), Hotpot QA (Yang et al. 2018) and 2Wiki Multihop QA (2WMQA) (Ho et al. 2020) for multi-hop QA. |
| Dataset Splits | Yes | We use the train data of Strategy QA and dev data of other datasets, and sample 500 instances for each dataset to reduce the costs of running experiments following previous work (Trivedi et al. 2023; Jiang et al. 2023). |
| Hardware Specification | Yes | We run all experiments on a Linux server with two 2.20 GHz Intel Xeon E5-2650 CPUs and an NVIDIA A100 GPU. |
| Software Dependencies | No | We use gpt-3.5-turbo2 as the LLM L, and BM25 algorithm implemented in Elasticsearch3 as the retriever R following (Jiang et al. 2023; Trivedi et al. 2023). We use Wikipedia dump from Dec 20, 2018 in (Karpukhin et al. 2020) as the knowledge source K following (Jiang et al. 2023; Asai et al. 2024). For the semantic and utility discriminator Ds and Du, we both adopt Llama-2-7b-chat-hf4. Reference retriever Ru is implemented with Bert and FAISS. Explanation: While specific models like 'Llama-2-7b-chat-hf' are mentioned, other key components like 'Elasticsearch', 'gpt-3.5-turbo', 'FAISS', and 'Bert' are mentioned without explicit version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | In QKE, we set Nc and α for consistency to 5 and 0.8. In UKP, we retrieve top-10 candidate knowledge from K before knowledge picking, and top-8 demonstrations as R from R. |