reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interpretable LLM-based Table Question Answering

Authors: Giang Nguyen, Ivan Brugere, Shubham Sharma, Sanjay Kariyappa, Anh Totti Nguyen, Freddy Lecue

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we show that: First, POS generates the highest-quality explanations among compared methods, which markedly improves the users ability to simulate and verify the model s decisions. Second, when evaluated on standard Table QA benchmarks (Tab Fact, Wiki TQ, and Fe Ta QA), POS achieves QA accuracy that is competitive to existing methods, while also offering greater efficiency requiring significantly fewer LLM calls and table database queries (up to 25 fewer) and more robust performance on large-sized tables. Finally, we observe high agreement (up to 90.59% in forward simulation) between LLMs and human users when making decisions based on the same explanations, suggesting that LLMs could serve as an effective proxy for humans in evaluating Table QA explanations.
Researcher Affiliation	Collaboration	Giang Nguyen EMAIL Auburn University Ivan Brugere EMAIL J.P. Morgan Shubham Sharma EMAIL J.P. Morgan Sanjay Kariyappa EMAIL NVIDIA Anh Totti Nguyen EMAIL Auburn University Freddy Lecue EMAIL J.P. Morgan
Pseudocode	No	The paper describes the Plan-of-SQLs (POS) method through a conceptual illustration in Figure 2 and details its components like Natural Language Planning and Step-to-SQL conversion in natural language text and prompt templates. It does not contain formal pseudocode blocks or algorithms.
Open Source Code	Yes	Code and data available at: https://github.com/anguyen8/pos
Open Datasets	Yes	We conduct experiments using three popular and standard Table QA benchmarks: Tab Fact (Chen et al., 2020), Wiki TQ (Pasupat & Liang, 2015), and Fe Ta QA Nan et al. (2022).
Dataset Splits	Yes	Tab Fact is a fact verification dataset in which each statement associated with a table is labeled TRUE or FALSE. We use the cleaned Tab Fact dataset from Wang et al. (2024) and evaluate Table QA methods with binary classification accuracy on the 2,024-sample test-small set. Wiki TQ is a question-answering dataset where the goal is to answer human-written questions using an input table. Using the dataset and evaluation scripts from Ye et al. (2023), we assess model denotation accuracy (whether the predicted answer is equal to the ground-truth answer) on the 4,344-sample standard test set. Fe Ta QA is a free-form Table QA dataset where the task is to generate free-form natural language responses based on information retrieved or inferred from a table. We evaluate models on the 2,003-sample standard test set using BLEU and ROUGE.
Hardware Specification	No	The paper mentions using Open AI’s LLMs (gpt-4-turbo-2024-04-09, gpt-4o, and gpt-4o-mini) and open-source LLMs (Qwen2.5-72B-Inst and Llama-3.1-405B-Inst hosted by Samba Nova), but does not specify the hardware (e.g., GPU models, CPU types, memory) used for their experiments or model training.
Software Dependencies	No	The paper mentions using "SQLite3" and "Python sqlite3 and pandas". While SQLite3 is mentioned with a citation, specific version numbers for SQLite3, Python, or the pandas library are not provided. The phrase "Python sqlite3 and pandas" is part of a prompt constraint and not a declaration of dependencies with versions.
Experiment Setup	Yes	Unless otherwise noted, we use a temp = 0 and top-p = 1 for LLM generation.