MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions
Authors: Jian Wu, Linyi Yang, Dongyuan Li, Yuliang Ji, Manabu Okumura, Yue Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to assess the multi-table and multi-hop understanding and reasoning abilities of the LLMs on our MMQA dataset. The results demonstrate the superiority of human performance over current SOTA LLMs, shedding light on the challenges encountered by existing models in performing multi-table tasks. |
| Researcher Affiliation | Academia | Jian Wu1 Linyi Yang2 Dongyuan Li4 Yuliang Ji5 Manabu Okumura1 Yue Zhang3 1Institute of Science Tokyo 2University College London 3School of Engineering, Westlake Univeristy 4The University of Tokyo 5Nanjing University of Science and Technology. |
| Pseudocode | Yes | Algorithm 1 Multi-Table Retrieval Initialize: Input: Multi-hop Question Q, LLM: GPT-4-turbo. Ouput: Retrieved Tables First Round: γ γ + α(q0, table0 j) only compute question-relevance scores in 1st round for i 1 to n do for j 0 to M do for k 0 to M do Compute Relevance Scores γ γ + α(qi, tablei j) β(tablei 1 k , tablei j) end for end for end for for i Arg Sort(γ, descending=True) do tablei max(γ, tablei) Select top K relevant tables end for Return tables |
| Open Source Code | No | The paper provides a link for the MMQA data (https://anonymous.4open.science/r/MMQA-34B1) and mentions implementation details in Appendix C and prompts in Appendix B, but does not explicitly state that the source code for their proposed methodology (MTR) is open-sourced or provide a direct link to a code repository. |
| Open Datasets | Yes | The Whole MMQA data are available at https://anonymous.4open.science/r/MMQA-34B1 |
| Dataset Splits | Yes | Specifically, we divide our MMQA benchmark into two parts: 2-table (2591 samples, average of 1833.31 rows and 6.04 columns) and 3-table (721 samples average of 1369.01 rows and 4.78 columns) subsets. ... We randomly select data with average table lengths of 500, 600, 700, 800, 900, and 1,000, sampling 50 samples for each type of data to evaluate the performance of LLM under different length tables. |
| Hardware Specification | Yes | For open-source models, all experiments are conducted on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions using specific LLMs (GPT-4-turbo, Table Llama-7b, SGPT-5.8B) and an optimizer (Adam), but does not provide specific version numbers for any software libraries or programming languages used for implementation. |
| Experiment Setup | Yes | We set the initial learning rate at 2e-5 and conducted training over three epochs. Optimization is performed using the Adam optimizer, with a batch size of 4 and a maximum input sequence length of 4,096. |