reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explore What LLM Does Not Know in Complex Question Answering

Authors: Xin Lin, Zhenya Huang, Zhiqiang Zhang, Jun Zhou, Enhong Chen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on four widely-used QA datasets, and the results demonstrate the effectiveness of the proposed method.
Researcher Affiliation	Academia	1School of Computer Science and Technology, University of Science and Technology of China, Hefei, China 2State Key Laboratory of Cognitive Intelligence, Hefei, China 3Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China 4Zhejiang University, Hangzhou, China 5Independent Researcher EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Consistency-based assessment Algorithm 2: KEQA inference
Open Source Code	Yes	Our codes are available at https://github.com/l-xin/KEQA.
Open Datasets	Yes	We use four benchmarks for QA including both one-hop and multi-hop QA tasks. We use the Natural Questions (NQ) (Kwiatkowski et al. 2019) for one-hop QA, and Strategy QA (Geva et al. 2021), Hotpot QA (Yang et al. 2018) and 2Wiki Multihop QA (2WMQA) (Ho et al. 2020) for multi-hop QA.
Dataset Splits	Yes	We use the train data of Strategy QA and dev data of other datasets, and sample 500 instances for each dataset to reduce the costs of running experiments following previous work (Trivedi et al. 2023; Jiang et al. 2023).
Hardware Specification	Yes	We run all experiments on a Linux server with two 2.20 GHz Intel Xeon E5-2650 CPUs and an NVIDIA A100 GPU.
Software Dependencies	No	We use gpt-3.5-turbo2 as the LLM L, and BM25 algorithm implemented in Elasticsearch3 as the retriever R following (Jiang et al. 2023; Trivedi et al. 2023). We use Wikipedia dump from Dec 20, 2018 in (Karpukhin et al. 2020) as the knowledge source K following (Jiang et al. 2023; Asai et al. 2024). For the semantic and utility discriminator Ds and Du, we both adopt Llama-2-7b-chat-hf4. Reference retriever Ru is implemented with Bert and FAISS. Explanation: While specific models like 'Llama-2-7b-chat-hf' are mentioned, other key components like 'Elasticsearch', 'gpt-3.5-turbo', 'FAISS', and 'Bert' are mentioned without explicit version numbers, which is required for reproducibility.
Experiment Setup	Yes	In QKE, we set Nc and α for consistency to 5 and 0.8. In UKP, we retrieve top-10 candidate knowledge from K before knowledge picking, and top-8 demonstrations as R from R.