TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Authors: Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xeron Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Tongliang Li, Zhoujun Li, Guanglin Niu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Massive experiments conducted on Table Bench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT4, achieves only a modest score compared to humans. |
| Researcher Affiliation | Collaboration | 1Beihang University 2M-A-P 3Fudan University 4Beijing Information Science and Technology University |
| Pseudocode | No | The paper describes reasoning methods (TCoT, SCoT, PoT) using formal definitions and descriptive steps (e.g., 'STEP-1: Analyzing the available information...', 'STEP-2: Generating instructions...', 'STEP-3: Simulating the outcomes...'), but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Table Bench/Table Bench |
| Open Datasets | Yes | We collect raw tabular data from existing datasets, including typical datasets such as WTQ (Pasupat and Liang 2015), SQA (Iyyer, Yih, and Chang 2017), Tab Fact (Nan et al. 2022), Fe Ta QA (Nan et al. 2022), Fin QA (Chen et al. 2021c), AIT-QA (Katsis et al. 2022), etc. ... Table Bench, a comprehensive and complex benchmark consisting of 886 samples, and Table Instruct (20K samples in total), massive instruction corpora designed to instruct LLMs with various reasoning methods. |
| Dataset Splits | Yes | We create a massively Table QA instruction corpora Table Instruct, covering three distinct reason-ing methods. ... Finally, we propose two high-quality corpora: Table Bench, a comprehensive and complex benchmark consisting of 886 samples, and Table Instruct (20K samples in total), massive instruction corpora designed to instruct LLMs with various reasoning methods. ... We conduct supervised finetuning of various open-source LLMs on the designated training set (Table Instruct). |
| Hardware Specification | Yes | For open-source models, we operate within the transformer environment on multiple A100 GPUs. |
| Software Dependencies | No | The paper mentions operating "within the transformer environment" and using "Python-based instruction" or a "language interpreter, like Python," but it does not specify any version numbers for these software components or other libraries. |
| Experiment Setup | Yes | We utilize a cosine annealing scheduler, setting the initial learning rate at 2e 5, and conduct training over three epochs. Optimization is performed using the Adam optimizer, with a batch size of 512 and a maximum sequence length of 4096. |