TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Authors: Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xeron Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Tongliang Li, Zhoujun Li, Guanglin Niu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Massive experiments conducted on Table Bench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT4, achieves only a modest score compared to humans.
Researcher Affiliation Collaboration 1Beihang University 2M-A-P 3Fudan University 4Beijing Information Science and Technology University
Pseudocode No The paper describes reasoning methods (TCoT, SCoT, PoT) using formal definitions and descriptive steps (e.g., 'STEP-1: Analyzing the available information...', 'STEP-2: Generating instructions...', 'STEP-3: Simulating the outcomes...'), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Table Bench/Table Bench
Open Datasets Yes We collect raw tabular data from existing datasets, including typical datasets such as WTQ (Pasupat and Liang 2015), SQA (Iyyer, Yih, and Chang 2017), Tab Fact (Nan et al. 2022), Fe Ta QA (Nan et al. 2022), Fin QA (Chen et al. 2021c), AIT-QA (Katsis et al. 2022), etc. ... Table Bench, a comprehensive and complex benchmark consisting of 886 samples, and Table Instruct (20K samples in total), massive instruction corpora designed to instruct LLMs with various reasoning methods.
Dataset Splits Yes We create a massively Table QA instruction corpora Table Instruct, covering three distinct reason-ing methods. ... Finally, we propose two high-quality corpora: Table Bench, a comprehensive and complex benchmark consisting of 886 samples, and Table Instruct (20K samples in total), massive instruction corpora designed to instruct LLMs with various reasoning methods. ... We conduct supervised finetuning of various open-source LLMs on the designated training set (Table Instruct).
Hardware Specification Yes For open-source models, we operate within the transformer environment on multiple A100 GPUs.
Software Dependencies No The paper mentions operating "within the transformer environment" and using "Python-based instruction" or a "language interpreter, like Python," but it does not specify any version numbers for these software components or other libraries.
Experiment Setup Yes We utilize a cosine annealing scheduler, setting the initial learning rate at 2e 5, and conduct training over three epochs. Optimization is performed using the Adam optimizer, with a batch size of 512 and a maximum sequence length of 4096.