Interpretable LLM-based Table Question Answering
Authors: Giang Nguyen, Ivan Brugere, Shubham Sharma, Sanjay Kariyappa, Anh Totti Nguyen, Freddy Lecue
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we show that: First, POS generates the highest-quality explanations among compared methods, which markedly improves the users ability to simulate and verify the model s decisions. Second, when evaluated on standard Table QA benchmarks (Tab Fact, Wiki TQ, and Fe Ta QA), POS achieves QA accuracy that is competitive to existing methods, while also offering greater efficiency requiring significantly fewer LLM calls and table database queries (up to 25 fewer) and more robust performance on large-sized tables. Finally, we observe high agreement (up to 90.59% in forward simulation) between LLMs and human users when making decisions based on the same explanations, suggesting that LLMs could serve as an effective proxy for humans in evaluating Table QA explanations. |
| Researcher Affiliation | Collaboration | Giang Nguyen EMAIL Auburn University Ivan Brugere EMAIL J.P. Morgan Shubham Sharma EMAIL J.P. Morgan Sanjay Kariyappa EMAIL NVIDIA Anh Totti Nguyen EMAIL Auburn University Freddy Lecue EMAIL J.P. Morgan |
| Pseudocode | No | The paper describes the Plan-of-SQLs (POS) method through a conceptual illustration in Figure 2 and details its components like Natural Language Planning and Step-to-SQL conversion in natural language text and prompt templates. It does not contain formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code and data available at: https://github.com/anguyen8/pos |
| Open Datasets | Yes | We conduct experiments using three popular and standard Table QA benchmarks: Tab Fact (Chen et al., 2020), Wiki TQ (Pasupat & Liang, 2015), and Fe Ta QA Nan et al. (2022). |
| Dataset Splits | Yes | Tab Fact is a fact verification dataset in which each statement associated with a table is labeled TRUE or FALSE. We use the cleaned Tab Fact dataset from Wang et al. (2024) and evaluate Table QA methods with binary classification accuracy on the 2,024-sample test-small set. Wiki TQ is a question-answering dataset where the goal is to answer human-written questions using an input table. Using the dataset and evaluation scripts from Ye et al. (2023), we assess model denotation accuracy (whether the predicted answer is equal to the ground-truth answer) on the 4,344-sample standard test set. Fe Ta QA is a free-form Table QA dataset where the task is to generate free-form natural language responses based on information retrieved or inferred from a table. We evaluate models on the 2,003-sample standard test set using BLEU and ROUGE. |
| Hardware Specification | No | The paper mentions using Open AI’s LLMs (gpt-4-turbo-2024-04-09, gpt-4o, and gpt-4o-mini) and open-source LLMs (Qwen2.5-72B-Inst and Llama-3.1-405B-Inst hosted by Samba Nova), but does not specify the hardware (e.g., GPU models, CPU types, memory) used for their experiments or model training. |
| Software Dependencies | No | The paper mentions using "SQLite3" and "Python sqlite3 and pandas". While SQLite3 is mentioned with a citation, specific version numbers for SQLite3, Python, or the pandas library are not provided. The phrase "Python sqlite3 and pandas" is part of a prompt constraint and not a declaration of dependencies with versions. |
| Experiment Setup | Yes | Unless otherwise noted, we use a temp = 0 and top-p = 1 for LLM generation. |