reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions

Authors: Jian Wu, Linyi Yang, Dongyuan Li, Yuliang Ji, Manabu Okumura, Yue Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to assess the multi-table and multi-hop understanding and reasoning abilities of the LLMs on our MMQA dataset. The results demonstrate the superiority of human performance over current SOTA LLMs, shedding light on the challenges encountered by existing models in performing multi-table tasks.
Researcher Affiliation	Academia	Jian Wu1 Linyi Yang2 Dongyuan Li4 Yuliang Ji5 Manabu Okumura1 Yue Zhang3 1Institute of Science Tokyo 2University College London 3School of Engineering, Westlake Univeristy 4The University of Tokyo 5Nanjing University of Science and Technology.
Pseudocode	Yes	Algorithm 1 Multi-Table Retrieval Initialize: Input: Multi-hop Question Q, LLM: GPT-4-turbo. Ouput: Retrieved Tables First Round: γ γ + α(q0, table0 j) only compute question-relevance scores in 1st round for i 1 to n do for j 0 to M do for k 0 to M do Compute Relevance Scores γ γ + α(qi, tablei j) β(tablei 1 k , tablei j) end for end for end for for i Arg Sort(γ, descending=True) do tablei max(γ, tablei) Select top K relevant tables end for Return tables
Open Source Code	No	The paper provides a link for the MMQA data (https://anonymous.4open.science/r/MMQA-34B1) and mentions implementation details in Appendix C and prompts in Appendix B, but does not explicitly state that the source code for their proposed methodology (MTR) is open-sourced or provide a direct link to a code repository.
Open Datasets	Yes	The Whole MMQA data are available at https://anonymous.4open.science/r/MMQA-34B1
Dataset Splits	Yes	Specifically, we divide our MMQA benchmark into two parts: 2-table (2591 samples, average of 1833.31 rows and 6.04 columns) and 3-table (721 samples average of 1369.01 rows and 4.78 columns) subsets. ... We randomly select data with average table lengths of 500, 600, 700, 800, 900, and 1,000, sampling 50 samples for each type of data to evaluate the performance of LLM under different length tables.
Hardware Specification	Yes	For open-source models, all experiments are conducted on 8 A100 GPUs.
Software Dependencies	No	The paper mentions using specific LLMs (GPT-4-turbo, Table Llama-7b, SGPT-5.8B) and an optimizer (Adam), but does not provide specific version numbers for any software libraries or programming languages used for implementation.
Experiment Setup	Yes	We set the initial learning rate at 2e-5 and conducted training over three epochs. Optimization is performed using the Adam optimizer, with a batch size of 4 and a maximum input sequence length of 4,096.