reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

StarCoder: may the source be with you!

Authors: Raymond Li, Loubna Ben allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia LI, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Joel Lamy-Poirier, Joao Monteiro, Nicolas Gontier, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Ben Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason T Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Urvashi Bhattacharyya, Wenhao Yu, Sasha Luccioni, Paulo Villegas, Fedor Zhdanov, Tony Lee, Nadav Timor, Jennifer Ding, Claire S Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro Von Werra, Harm de Vries

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform the most comprehensive evaluation of Code LLMs to date and show that Star Coder Base outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the Open AI code-cushman-001 model.
Researcher Affiliation	Collaboration	1Hugging Face 2Service Now Research 3Northeastern University 5Independent 6Mila 7Carnegie Mellon University 8Johns Hopkins University 9Leipzig University 11Queen Mary University of London 13Sea AI Lab 14Technion Israel Institute of Technology 15Monash University 18Saama AI Research Lab 19University of British Columbia 20MIT 21Technical University of Munich 22IBM Research 23University of Vermont 24Unfold ML 25SAP 26University of Notre Dame 27Columbia University 28Discover Dollar Pvt Ltd 29NYU 30University of Allahabad 31Telefonica I+D 32Toloka 33Stanford University 34Weizmann Institute of Science 35The Alan Turing Institute 36Wellesley College 37Eleuther AI 38Forschungszentrum Jülich
Pseudocode	No	The paper describes methods and procedures in narrative form, without explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	By releasing the Star Coder models with an Open Responsible AI Model license, and by opensourcing all code repositories for building the model on Git Hub, we aim to increase access, reproducibility, and transparency of Code LLMs in the research and developer communities.
Open Datasets	Yes	Star Coder Base is trained on 1 trillion tokens sourced from The Stack (Kocetkov et al., 2022), a large collection of permissively licensed Git Hub repositories with inspection tools and an opt-out process.
Dataset Splits	Yes	We split the dataset into a training set of 7,878 examples and a test set of 4,000 examples, ensuring that both splits have a balanced representation of the diﬀerent PII types. See Table 7.
Hardware Specification	Yes	We trained our model on a GPU cluster with 512 A100 80 GB GPUs distributed across 64 nodes.
Software Dependencies	No	The paper mentions several software tools and libraries such as 'Hugging Face Tokenizers library', 'Flash Attention', 'Megatron-LM', 'Jupytext', 'Guesslang', 'fasttext library', 'detect-secrets tool', and 'ipaddress python package', but does not provide specific version numbers for these key software components required for reproduction.
Experiment Setup	Yes	The model was trained for 250k iterations, with a batch size of 4M tokens, for a total of one trillion tokens. We used Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.95, ε = 10−8 and a weight decay of 0.1. The learning rate followed a cosine decay from 3 × 10−4 to 3 × 10−5 after a linear warmup of 2,000 iterations.