reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models

Authors: Zhenyu Pan, Haozheng Luo, Manling Li, Han Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we exploit both public benchmarks and a Web3 case study to demonstrate the capability of Co A over other methods. ... In this section, we compare the performance of our Chain-of-Action framework with state-of-the-art baselines across public benchmarks. Subsequently, we provide a detailed analysis of our launched case study: a Question Answering (QA) application in the Web3 domain.
Researcher Affiliation	Academia	Zhenyu Pan Haozheng Luo Manling Li Han Liu Department of Computer Science, Northwestern University, Evanston, IL 60208, USA Department of Statistics and Data Science, Northwestern University, Evanston, IL 60208, USA EMAIL EMAIL
Pseudocode	Yes	B ALGORITHMS Algorithm 1 Description of Actions Workflow
Open Source Code	No	The paper does not provide an explicit statement about open-sourcing the code or a link to a code repository.
Open Datasets	Yes	We select 4 classic, 1 long-form, and 1 open-domain QA task. Four classic QA tasks that include web-based QA (WQA) [2], general QA1 (DATE, General Knowledge, Social QA (So QA)), Truth QA [24], Strategy QA (SQA) [6], and Fact Checking (FEVER [26]). Long-form QA task is the first long-form QA dataset focusing on ambiguous factoid questions, ASQA [25]. Open-domain QA task is QRe CC [1], testing the ability to handle context-dependent queries across different domains.
Dataset Splits	No	The paper references public benchmarks (WQA [2], Truth QA [24], Strategy QA (SQA) [6], FEVER [26], ASQA [25], QRe CC [1]), but does not explicitly provide the specific training/test/validation split percentages or methodology within its text.
Hardware Specification	Yes	All experiments are carried out on a cluster, with the exception of the distributed compute node experiment. Each node within the cluster is equipped with 1 NVIDIA GEFORCE RTX 2080 Ti GPUs and 6 8-core Intel XEON Silver 4214 processors running at 2.20GHz. The combined RAM capacity across the cluster nodes amounts to 755GB, and the operating system employed is Ubuntu 18.04.
Software Dependencies	No	The paper mentions using 'gpt-3.5-turbo' and 'GPT-4' as models and 'Langchain' for React implementation, but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	Below, we provide a list of all the hyperparameters used in our experiments. Table 8: Hyperparameter used in the task. parameter values temperature 0.0 max_length 1000 top_p 1.0 n_clusters 5 retrieval_number 3 seed 1