InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

Authors: Gaurav Sahu, Abhay Puri, Juan A. Rodriguez, Amirhossein Abaskohi, Mohammad Chegini, Alexandre Drouin, Perouz Taslakian, Valentina Zantedeschi, Alexandre Lacoste, David Vazquez, Nicolas Chapados, Christopher Pal, Sai Rajeswar, Issam Laradji

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Insight Bench, a benchmark dataset with three key features. ... Furthermore, we implement a two-way evaluation mechanism using LLa MA-3 as an effective, open-source evaluator to assess agents ability to extract insights. ... Our evaluation on Insight Bench shows that Agent Poirot outperforms existing approaches (such as Pandas Agent) that focus on resolving single queries. We also compare the performance of openand closed-source LLMs and various evaluation strategies. Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics... We conduct the following ablation studies to understand the importance of different parameters of Insight Bench and their effect on agent performance. Table 1 shows the performance of different data analytics agents on Insight Bench.
Researcher Affiliation Collaboration 1Service Now Research 2Mila Quebec AI Institute 3Canada CIFAR AI Chair 4 Ecole de Technologie Sup erieure 5University of British Columbia 6University of Waterloo 7 University of Victoria
Pseudocode No The paper describes the steps of Agent Poirot in Section 3.1(b) and visually in Figure 7. Appendix D also provides detailed prompts. However, none of these are presented as a formal pseudocode or algorithm block with structured control flow statements.
Open Source Code Yes Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics and can be accessed here: https://github.com/Service Now/insight-bench. ... All our code, including data used for model implementation, training, and evaluation, is available in the supplementary material.
Open Datasets Yes We introduce Insight Bench, a benchmark dataset... can be accessed here: https://github.com/Service Now/insight-bench.
Dataset Splits No The paper describes the Insight Bench dataset as consisting of 100 datasets and evaluates agents on these. It mentions grouping by difficulty and dataset category, but it does not specify traditional training, validation, and test splits for model training or evaluation of new models on these splits. The evaluation is performed on the entire set of 100 datasets.
Hardware Specification Yes We use Python s openai package to access the family of GPT models for our experiments and vllm to host LLa MA-3-70b model on 4 A100 GPUs. ... All LLa MA-3 experiments were conducted on 2 x 80G A100 GPUs.
Software Dependencies Yes We use Python s openai package to access the family of GPT models for our experiments and vllm to host LLa MA-3-70b 3 model on 4 A100 GPUs. ... We use the evaluate Python package to compute ROUGE-1 scores... Dependencies: A complete list of software dependencies, is provided in the requirements.txt file of the supplementary material.
Experiment Setup Yes All results are reported for the sampling temperature of 0.0, unless otherwise stated. ... We repeat all our experiments for 5 seeds and report the mean and standard deviation in our results.