reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Are Large Language Models Ready for Multi-Turn Tabular Data Analysis?

Authors: Jinyang Li, Nan Huo, Yan Gao, Jiayi Shi, Yingxiu Zhao, Ge Qu, Bowen Qin, Yurong Wu, Xiaodong Li, Chenhao Ma, Jian-Guang Lou, Reynold Cheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate popular and advanced LLMs in COTA, which highlights the challenges of conversational tabular data analysis. Furthermore, we propose Adaptive Conversation Reflection (ACR), a selfgenerated reflection strategy that guides LLMs to learn from successful histories. Experiments demonstrate that ACR can evolve LLMs into effective conversational tabular data analysis agents, achieving a relative performance improvement of up to 35.14%.
Researcher Affiliation	Collaboration	1School of Computing and Data Science, The University of Hong Kong 2Microsoft 3The Chinese University of Hong Kong, Shenzhen 4Alibaba Group 5Beijing Academy of Artificial Intelligence 6Xiamen University.
Pseudocode	Yes	Pseudo Code Logic Generation. First, given the last previous history (ut 1; at 1), when t > 1, we prompt the data analysis agent to reflect and generate its underlying logic mt 1 = fθ(ut 1; at 1), where fθ refers to agent based on LLMs with parameter θ. Also, (x; y) represents two elements x and y are concatenated in the prompt. In our work, we consider the pseudocode to be m, as it serves as an intermediate logic between natural language queries and codes.
Open Source Code	Yes	Code can be found at https: //tapilot-crossing.github.io/
Open Datasets	Yes	We collect open-source tables from Kaggle, a popular data science platform.
Dataset Splits	No	The paper does not explicitly provide information about reproducible training, testing, or validation dataset splits. It mentions categories of conversations (clear, action, private lib, private act) and sampling for human evaluation, but not standard splits for model training and evaluation.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory amounts) used for running the experiments. It refers to using various LLMs but not the underlying hardware for the authors' experimental setup.
Software Dependencies	No	The paper mentions several software components like Python 3 (for the Executor), sqlite3 python package, pandas, numpy, PIL, and AST package. However, it does not consistently provide specific version numbers for these software dependencies, which is required for reproducible description. The mention of 'pandas version 2.0.3 and matplotlib version 3.7.4' is within a prompt instructing the LLM on what versions to generate code for, not stating the versions used in the authors' experimental setup.
Experiment Setup	Yes	The temperature parameter is set to 0.0 for Claude, GPT-4, and GPT-4-Turbo and top_p to 1.0. Specifically, we set the MAX STEP for Code Act reasoning to 5, with the Executor serving as the primary tool.