The Stack: 3 TB of permissively licensed source code

Authors: Denis Kocetkov, Raymond Li, Loubna Ben allal, Jia LI, Chenghao Mou, Yacine Jernite, Margaret Mitchell, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro Von Werra, Harm de Vries

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train 350M decoder-only transformers on several python subsets of the data and find that removing near-duplicates significantly boosts performance in all experiments. We show it is possible to reproduce text2code performance of Codex (Chen et al., 2021) and Code Gen (Nijkamp et al., 2022) by only using permissively licensed data. ... We report the Human Eval and MBPP results in Table 5 and 6, respectively.
Researcher Affiliation Industry Denis Kocetkov Service Now Research Loubna Ben Allal Hugging Face Jia Li Independent Researcher Chenghao Mou Independent Researcher Carlos Muñoz Ferrandis Hugging Face Yacine Jernite Hugging Face Margaret Mitchell Hugging Face Sean Hughes Service Now Thomas Wolf Hugging Face Dzmitry Bahdanau Service Now Research Leandro von Werra Hugging Face Harm de Vries Service Now Research
Pseudocode No The paper describes methods for dataset creation, licensing, deduplication, and experimental setup but does not include any explicitly labeled pseudocode or algorithm blocks. The procedural steps are described in paragraph form.
Open Source Code No The paper mentions: "We release this dataset along with a near-deduplicated version at https: //hf.co/Big Code." and "We use a fork14 of Megatron-LM (Shoeybi et al., 2019) for training." with footnote 14 linking to https://github.com/bigcode-project/Megatron-LM. The first is for the dataset, not the methodology's code. The second is a framework used, not the specific code for their contributions or experiments.
Open Datasets Yes We present The Stack, a large dataset with 3.1 TB of permissively licensed source code in 30 programming languages. We release this dataset along with a near-deduplicated version at https: //hf.co/Big Code.
Dataset Splits No The paper mentions evaluating on existing benchmarks like Human Eval and MBBP, describing their characteristics (e.g., Human Eval with 164 programming problems, MBPP test set of 500 examples). However, it does not specify explicit training/test/validation splits (e.g., percentages or counts) for its own 'The Stack' dataset used for training models.
Hardware Specification No The paper states: "Lastly, we are grateful to Service Now and Hugging Face for the provided compute resources." This acknowledges compute resources but does not provide specific hardware details like GPU models or CPU types.
Software Dependencies No The paper mentions: "We use a fork14 of Megatron-LM (Shoeybi et al., 2019) for training." and "The Byte-Pair Encoding tokenizer was trained on a 50-50 mixture of the Pile (Gao et al., 2020) and Python files from The Stack." While Megatron-LM is named, no specific version number is provided for it or any other key software dependencies or libraries.
Experiment Setup Yes We opt for a 350M parameter model with 24 layers, a hidden dimension of 1024, 16 attention heads, and a sequence length of 2048. The model is trained for 300K iterations with a global batch size of 384 using Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.95, ϵ = 10 8 and a weight decay of 0.1. The learning rate set to 3 10 4 is warmed up for 175 steps, then follows a cosine decay. The model processes 235.9B tokens during training.