reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

Authors: Gaurav Sahu, Abhay Puri, Juan A. Rodriguez, Amirhossein Abaskohi, Mohammad Chegini, Alexandre Drouin, Perouz Taslakian, Valentina Zantedeschi, Alexandre Lacoste, David Vazquez, Nicolas Chapados, Christopher Pal, Sai Rajeswar, Issam Laradji

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Insight Bench, a benchmark dataset with three key features. ... Furthermore, we implement a two-way evaluation mechanism using LLa MA-3 as an effective, open-source evaluator to assess agents ability to extract insights. ... Our evaluation on Insight Bench shows that Agent Poirot outperforms existing approaches (such as Pandas Agent) that focus on resolving single queries. We also compare the performance of openand closed-source LLMs and various evaluation strategies. Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics... We conduct the following ablation studies to understand the importance of different parameters of Insight Bench and their effect on agent performance. Table 1 shows the performance of different data analytics agents on Insight Bench.
Researcher Affiliation	Collaboration	1Service Now Research 2Mila Quebec AI Institute 3Canada CIFAR AI Chair 4 Ecole de Technologie Sup erieure 5University of British Columbia 6University of Waterloo 7 University of Victoria
Pseudocode	No	The paper describes the steps of Agent Poirot in Section 3.1(b) and visually in Figure 7. Appendix D also provides detailed prompts. However, none of these are presented as a formal pseudocode or algorithm block with structured control flow statements.
Open Source Code	Yes	Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics and can be accessed here: https://github.com/Service Now/insight-bench. ... All our code, including data used for model implementation, training, and evaluation, is available in the supplementary material.
Open Datasets	Yes	We introduce Insight Bench, a benchmark dataset... can be accessed here: https://github.com/Service Now/insight-bench.
Dataset Splits	No	The paper describes the Insight Bench dataset as consisting of 100 datasets and evaluates agents on these. It mentions grouping by difficulty and dataset category, but it does not specify traditional training, validation, and test splits for model training or evaluation of new models on these splits. The evaluation is performed on the entire set of 100 datasets.
Hardware Specification	Yes	We use Python s openai package to access the family of GPT models for our experiments and vllm to host LLa MA-3-70b model on 4 A100 GPUs. ... All LLa MA-3 experiments were conducted on 2 x 80G A100 GPUs.
Software Dependencies	Yes	We use Python s openai package to access the family of GPT models for our experiments and vllm to host LLa MA-3-70b 3 model on 4 A100 GPUs. ... We use the evaluate Python package to compute ROUGE-1 scores... Dependencies: A complete list of software dependencies, is provided in the requirements.txt file of the supplementary material.
Experiment Setup	Yes	All results are reported for the sampling temperature of 0.0, unless otherwise stated. ... We repeat all our experiments for 5 seeds and report the mean and standard deviation in our results.