BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Authors: Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen GONG, James Hoang, Armel Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Binyuan Hui, Niklas Muennighoff, David Lo, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro Von Werra

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess how well LLMs can solve challenging and practical tasks via programs, we introduce Big Code Bench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of Big Code Bench, Big Code Bench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.
Researcher Affiliation Collaboration 1Monash University 2CSIRO s Data61 3Singapore Management University 4Detomo Inc., Japan 5Queen Mary University of London 6University of Notre Dame 7TU Darmstadt 8Independent 9University of Virginia 10Inria 11Intel 12Tencent AI Lab 13Cornell University 14University College London 15UNC Chapel Hill 16UC Berkeley 17MIT 18Shanghai Jiaotong University 19Uber 20UIUC 21Sea AI Lab 22AWS AI Labs 23Contextual AI 24Carnegie Mellon University 25Service Now Research 26Hugging Face
Pseudocode No The paper includes figures (e.g., Figure 1) showing Python code examples and task descriptions, but does not contain explicitly labeled pseudocode or algorithm blocks. The methods are described using text and code snippets, not abstracted pseudocode.
Open Source Code Yes Table 4: Artifacts for reproducibility. Annotation Framework: Git Hub https://github.com/bigcode-project/bigcodebench-annotation. Evaluation Framework: Git Hub https://github.com/bigcode-project/bigcodebench Py PI https://pypi.org/project/bigcodebench/.
Open Datasets Yes The evaluation dataset will be released to the public, and hosted on Hugging Face. Our dataset is distributed under the Apache-2.0 license. Table 4: Artifacts for reproducibility. Big Code Bench (v0.2.4): Hugging Face https://huggingface.co/datasets/bigcode/bigcodebench.
Dataset Splits No The paper introduces Big Code Bench as a benchmark with 1,140 tasks, each encompassing 5.6 test cases for evaluation. It differentiates between 'Big Code Bench-Complete' and 'Big Code Bench-Instruct' for different prompting scenarios. However, it does not describe standard machine learning dataset splits (e.g., train/validation/test sets with percentages or counts) as Big Code Bench is an evaluation benchmark for LLMs rather than a dataset for training models with internal splits.
Hardware Specification Yes We perform all the model inference on A100 GPUs, except for the closed ones. For the closed models, we rely on their official APIs provided in the documents. We conduct the execution mainly on the Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz, composed of 2 sockets, with 18 cores per socket.
Software Dependencies Yes You are given a file named requirements.txt , please set up a Python environment with version 3.8.10. ... pandas==2.0.3 scikit-learn==1.3.1 requests==2.31.0 matplotlib==3.7.0 seaborn==0.13.2 numpy==1.21.2 numba==0.55.0 cryptography==38.0.0 scipy==1.7.2 nltk==3.8 pytz==2023.3.post1 networkx==2.6.3 statsmodels==0.14.0 lxml==4.9.3 psutil==5.9.5 Django==4.2.7 selenium==4.15. Pillow==10.3.0 beautifulsoup4==4.8.2 datetime==5.5 python-docx==1.1.0 openpyxl==3.1.2 Levenshtein==0.25.0 Py YAML==6.0.1 wordninja==2.0.0 Faker==20.1.0 tensorflow==2.11.1 wordcloud==1.9.3 pytesseract==0.3.10 chardet==5.2.0 python-dateutil==2.9.0 blake3==0.4.1 dnspython==2.6.1 flask==3.0.3 Flask-Mail==0.9.1 flask_login==0.6.3 flask_restful==0.3.10 flask_wtf==1.2.1 folium==0.16.0 geopy==2.4.1 keras==2.11.0 librosa==0.10.1 mechanize==0.4.9 prettytable==3.10.0 pycryptodome==3.14.1 python_http_client==3.3.7 Requests==2.31.0 requests_mock==1.11.0 rsa==4.9 sendgrid==6.11.0 soundfile==0.12.1 texttable==1.7.0 Werkzeug==3.0.1 WTForms==3.1.2 xlrd==2.0.1 xlwt==1.3.0 xmltodict==0.13.0 python-Levenshtein-wheels gensim==4.3.2 sympy==1.12 pyfakefs==5.4.1 textblob==0.18.0 docxtpl==0.11.5 statsmodels==0.14.0 pyquery==1.4.3 holidays==0.29 scikit-image==0.18.0 natsort==7.1.1 shapely==2.0.4 geopandas==0.13.2 opencv-python-headless==4.9.0.80 xlrd==2.0.1 pytest==8.2.0 wikipedia==1.4.0
Experiment Setup Yes Our evaluation uses the unbiased version of Pass@K (Chen et al., 2021) to accurately assess the functional correctness of generated code snippets by LLMs. To make general observations, we extensively evaluate 60 state-of-the-art LLMs on Big Code Bench-Complete and 35 instruction-tuned LLMs on Big Code Bench-Instruct. Specifically, following prior works (Roziere et al., 2023; Liu et al., 2024; Lai et al., 2023), we report Pass@1 with greedy decoding for the main experiments in the zero-shot setting. To investigate more thoroughly, we compute Pass@1 and Pass@5 results with random sampling to generate N (N=5) samples with a temperature of 0.8 and top-p of 0.95 in Appendix L. We use the same prompts for code generation from (Liu et al., 2024), given in Appendix K.