Are Large Language Models Ready for Multi-Turn Tabular Data Analysis?

Authors: Jinyang Li, Nan Huo, Yan Gao, Jiayi Shi, Yingxiu Zhao, Ge Qu, Bowen Qin, Yurong Wu, Xiaodong Li, Chenhao Ma, Jian-Guang Lou, Reynold Cheng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate popular and advanced LLMs in COTA, which highlights the challenges of conversational tabular data analysis. Furthermore, we propose Adaptive Conversation Reflection (ACR), a selfgenerated reflection strategy that guides LLMs to learn from successful histories. Experiments demonstrate that ACR can evolve LLMs into effective conversational tabular data analysis agents, achieving a relative performance improvement of up to 35.14%.
Researcher Affiliation Collaboration 1School of Computing and Data Science, The University of Hong Kong 2Microsoft 3The Chinese University of Hong Kong, Shenzhen 4Alibaba Group 5Beijing Academy of Artificial Intelligence 6Xiamen University.
Pseudocode Yes Pseudo Code Logic Generation. First, given the last previous history (ut 1; at 1), when t > 1, we prompt the data analysis agent to reflect and generate its underlying logic mt 1 = fθ(ut 1; at 1), where fθ refers to agent based on LLMs with parameter θ. Also, (x; y) represents two elements x and y are concatenated in the prompt. In our work, we consider the pseudocode to be m, as it serves as an intermediate logic between natural language queries and codes.
Open Source Code Yes Code can be found at https: //tapilot-crossing.github.io/
Open Datasets Yes We collect open-source tables from Kaggle, a popular data science platform.
Dataset Splits No The paper does not explicitly provide information about reproducible training, testing, or validation dataset splits. It mentions categories of conversations (clear, action, private lib, private act) and sampling for human evaluation, but not standard splits for model training and evaluation.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments. It refers to using various LLMs but not the underlying hardware for the authors' experimental setup.
Software Dependencies No The paper mentions several software components like Python 3 (for the Executor), sqlite3 python package, pandas, numpy, PIL, and AST package. However, it does not consistently provide specific version numbers for these software dependencies, which is required for reproducible description. The mention of 'pandas version 2.0.3 and matplotlib version 3.7.4' is within a prompt instructing the LLM on what versions to generate code for, not stating the versions used in the authors' experimental setup.
Experiment Setup Yes The temperature parameter is set to 0.0 for Claude, GPT-4, and GPT-4-Turbo and top_p to 1.0. Specifically, we set the MAX STEP for Code Act reasoning to 5, with the Executor serving as the primary tool.