reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Authors: Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xeron Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Tongliang Li, Zhoujun Li, Guanglin Niu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Massive experiments conducted on Table Bench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT4, achieves only a modest score compared to humans.
Researcher Affiliation	Collaboration	1Beihang University 2M-A-P 3Fudan University 4Beijing Information Science and Technology University
Pseudocode	No	The paper describes reasoning methods (TCoT, SCoT, PoT) using formal definitions and descriptive steps (e.g., 'STEP-1: Analyzing the available information...', 'STEP-2: Generating instructions...', 'STEP-3: Simulating the outcomes...'), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Table Bench/Table Bench
Open Datasets	Yes	We collect raw tabular data from existing datasets, including typical datasets such as WTQ (Pasupat and Liang 2015), SQA (Iyyer, Yih, and Chang 2017), Tab Fact (Nan et al. 2022), Fe Ta QA (Nan et al. 2022), Fin QA (Chen et al. 2021c), AIT-QA (Katsis et al. 2022), etc. ... Table Bench, a comprehensive and complex benchmark consisting of 886 samples, and Table Instruct (20K samples in total), massive instruction corpora designed to instruct LLMs with various reasoning methods.
Dataset Splits	Yes	We create a massively Table QA instruction corpora Table Instruct, covering three distinct reason-ing methods. ... Finally, we propose two high-quality corpora: Table Bench, a comprehensive and complex benchmark consisting of 886 samples, and Table Instruct (20K samples in total), massive instruction corpora designed to instruct LLMs with various reasoning methods. ... We conduct supervised finetuning of various open-source LLMs on the designated training set (Table Instruct).
Hardware Specification	Yes	For open-source models, we operate within the transformer environment on multiple A100 GPUs.
Software Dependencies	No	The paper mentions operating "within the transformer environment" and using "Python-based instruction" or a "language interpreter, like Python," but it does not specify any version numbers for these software components or other libraries.
Experiment Setup	Yes	We utilize a cosine annealing scheduler, setting the initial learning rate at 2e 5, and conduct training over three epochs. Optimization is performed using the Adam optimizer, with a batch size of 512 and a maximum sequence length of 4096.