reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

McEval: Massively Multilingual Code Evaluation

Authors: Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, JinKe, JIAHENG LIU, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, Zekun Wang, Boyang Wang, Xianjie Wu, Bing Wang, Tongliang Li, Liqun Yang, Sufeng Duan, Zhaoxiang Zhang, Zhoujun Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on MCEVAL show that there is still a difficult journey between open-source models and closed-source LLMs in numerous languages. 4 EXPERIMENTS
Researcher Affiliation	Academia	1CCSE, Beihang University, 2University of British Columbia, 3University of Waterloo 4Beijing Information Science and Technology University, 5Shanghai Jiao Tong University
Pseudocode	No	The paper includes equations and code examples (Figures 3, 10, 11, 12) but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The instruction corpora and evaluation benchmark are available at https://github.com/MCEVAL/Mc Eval.
Open Datasets	Yes	The instruction corpora and evaluation benchmark are available at https://github.com/MCEVAL/Mc Eval.
Dataset Splits	Yes	The resulting dataset, MCEVAL-INSTRUCT (110K samples), is comprised of created question-answer pairs and open-source collection (Wei et al., 2023). We apply data decontamination before training our MCODER. Following Li et al. (2023); Wei et al. (2023), we adopt the N-gram exact match decontamination method with MCEVAL, Human Eval(Chen et al., 2021), Multi PL-E(Cassano et al., 2023), MBPP(Austin et al., 2021). For supervised fine-tuning (SFT), we utilize Code Qwen-1.5 as the foundational code LLMs. Specifically, we select all Python data from MCEVAL-INSTRUCT, comprising 50K training samples, for MCODER-Python training.
Hardware Specification	Yes	All MCODER models are fine-tuned using 8 NVIDIA A800-80GB GPUs.
Software Dependencies	Yes	We adopt the greedy Pass@1 (%) metric (Kulal et al., 2019; Chen et al., 2021) for our evaluations. For closed-source models, we generate answers through the official API service. For open-source models, we prioritize using v LLM (Kwon et al., 2023) for faster inference if the model is supported by v LLM. Otherwise, we perform inference with the Distributed Data Parallel (DDP) module from Py Torch. For the code generation and code completion tasks, we extract the functional part of the code from the model outputs and combine it with corresponding test cases to form compilable and executable code. For the code explanation task, we adopt a two-pass generation approach (Code-to-Natural-Language and Natural-Language-to-Code). The extraction and execution process for this task is consistent with the previous two tasks. We conduct all evaluations in a Docker environment. Detailed information on the code compilation and execution environment are displayed in Table 6. We have uploaded the Docker image to docker hub to facilitate the reproduction of results and the evaluation of new models. Table 6: Runtime environments for different programming languages. Language Runtime Environments AWK GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu) C gcc (Ubuntu 7.5.0-3ubuntu1 18.04) 7.5.0 C# dotnet 8.0.100 CPP g++ (Ubuntu 7.5.0-3ubuntu1 18.04) 7.5.0 Coffee Script Coffee Script version 1.12.7 Common Lisp SBCL 1.4.5.debian Dart Dart SDK version: 3.3.1 (stable) Elixir elixir 1.3.3 Emacs Lisp GNU Emacs 25.2.2 Erlang Erlang/OTP 20 [erts-9.2] F# dotnet 8.0.100 Fortran GNU Fortran (Ubuntu 7.5.0-3ubuntu1 18.04) 7.5.0 Go go version go1.18.4 linux/amd64 Groovy Groovy Version: 4.0.16 JVM: 17.0.9 Vendor: Oracle Corporation OS: Linux Haskell The Glorious Glasgow Haskell Compilation System, version 9.4.7 Java javac 11.0.19 Java Script Node.js v16.14.0 Julia julia v1.9.4 Kotlin kotlinc-jvm 1.9.21 (JRE 17.0.9+11-LTS-201) Lua Lua 5.4.6 Copyright (C) 1994-2023 Lua.org, PUC-Rio PHP PHP 7.2.24-0ubuntu0.18.04.17 (cli) (built: Feb 23 2023 13:29:25) ( NTS ) Pascal Free Pascal Compiler version 3.2.2 [2021/05/16] for x86_64 Perl perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-linux-gnu-thread-multi Power Shell Power Shell 7.4.0 Python Python 3.8.12 R R version 3.4.4 Racket Racket v6.11 Ruby ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux-gnu] Rust rustc 1.74.0 (79e9716c9 2023-11-13) Scala Scala code runner version 3.3.1 Copyright 2002-2023, LAMP/EPFL Scheme Racket v6.11 Shell GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu) Swift Swift version 5.9.2 (swift-5.9.2-RELEASE) Tcl tclsh 8.6.11 Type Script tsc Version 5.3.3 Vim Script VIM Vi IMproved 9.0 (2022 Jun 28, compiled Dec 20 2023 18:57:50) Visual Basic dotnet 8.0.100
Experiment Setup	Yes	All MCODER models are fine-tuned using 8 NVIDIA A800-80GB GPUs. The models are trained for 2 epochs with a cosine scheduler, starting at a learning rate of 2e-5 and incorporating a 3% warmup phase. Training a model takes about 5 hours. We used Adam W (Loshchilov & Hutter, 2017) as the optimizer and a batch size of 512 with a sequence truncation length of 4096.