MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions

Authors: Jian Wu, Linyi Yang, Dongyuan Li, Yuliang Ji, Manabu Okumura, Yue Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to assess the multi-table and multi-hop understanding and reasoning abilities of the LLMs on our MMQA dataset. The results demonstrate the superiority of human performance over current SOTA LLMs, shedding light on the challenges encountered by existing models in performing multi-table tasks.
Researcher Affiliation Academia Jian Wu1 Linyi Yang2 Dongyuan Li4 Yuliang Ji5 Manabu Okumura1 Yue Zhang3 1Institute of Science Tokyo 2University College London 3School of Engineering, Westlake Univeristy 4The University of Tokyo 5Nanjing University of Science and Technology.
Pseudocode Yes Algorithm 1 Multi-Table Retrieval Initialize: Input: Multi-hop Question Q, LLM: GPT-4-turbo. Ouput: Retrieved Tables First Round: γ γ + α(q0, table0 j) only compute question-relevance scores in 1st round for i 1 to n do for j 0 to M do for k 0 to M do Compute Relevance Scores γ γ + α(qi, tablei j) β(tablei 1 k , tablei j) end for end for end for for i Arg Sort(γ, descending=True) do tablei max(γ, tablei) Select top K relevant tables end for Return tables
Open Source Code No The paper provides a link for the MMQA data (https://anonymous.4open.science/r/MMQA-34B1) and mentions implementation details in Appendix C and prompts in Appendix B, but does not explicitly state that the source code for their proposed methodology (MTR) is open-sourced or provide a direct link to a code repository.
Open Datasets Yes The Whole MMQA data are available at https://anonymous.4open.science/r/MMQA-34B1
Dataset Splits Yes Specifically, we divide our MMQA benchmark into two parts: 2-table (2591 samples, average of 1833.31 rows and 6.04 columns) and 3-table (721 samples average of 1369.01 rows and 4.78 columns) subsets. ... We randomly select data with average table lengths of 500, 600, 700, 800, 900, and 1,000, sampling 50 samples for each type of data to evaluate the performance of LLM under different length tables.
Hardware Specification Yes For open-source models, all experiments are conducted on 8 A100 GPUs.
Software Dependencies No The paper mentions using specific LLMs (GPT-4-turbo, Table Llama-7b, SGPT-5.8B) and an optimizer (Adam), but does not provide specific version numbers for any software libraries or programming languages used for implementation.
Experiment Setup Yes We set the initial learning rate at 2e-5 and conducted training over three epochs. Optimization is performed using the Adam optimizer, with a batch size of 4 and a maximum input sequence length of 4,096.