The Stack: 3 TB of permissively licensed source code
Authors: Denis Kocetkov, Raymond Li, Loubna Ben allal, Jia LI, Chenghao Mou, Yacine Jernite, Margaret Mitchell, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro Von Werra, Harm de Vries
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train 350M decoder-only transformers on several python subsets of the data and find that removing near-duplicates significantly boosts performance in all experiments. We show it is possible to reproduce text2code performance of Codex (Chen et al., 2021) and Code Gen (Nijkamp et al., 2022) by only using permissively licensed data. ... We report the Human Eval and MBPP results in Table 5 and 6, respectively. |
| Researcher Affiliation | Industry | Denis Kocetkov Service Now Research Loubna Ben Allal Hugging Face Jia Li Independent Researcher Chenghao Mou Independent Researcher Carlos Muñoz Ferrandis Hugging Face Yacine Jernite Hugging Face Margaret Mitchell Hugging Face Sean Hughes Service Now Thomas Wolf Hugging Face Dzmitry Bahdanau Service Now Research Leandro von Werra Hugging Face Harm de Vries Service Now Research |
| Pseudocode | No | The paper describes methods for dataset creation, licensing, deduplication, and experimental setup but does not include any explicitly labeled pseudocode or algorithm blocks. The procedural steps are described in paragraph form. |
| Open Source Code | No | The paper mentions: "We release this dataset along with a near-deduplicated version at https: //hf.co/Big Code." and "We use a fork14 of Megatron-LM (Shoeybi et al., 2019) for training." with footnote 14 linking to https://github.com/bigcode-project/Megatron-LM. The first is for the dataset, not the methodology's code. The second is a framework used, not the specific code for their contributions or experiments. |
| Open Datasets | Yes | We present The Stack, a large dataset with 3.1 TB of permissively licensed source code in 30 programming languages. We release this dataset along with a near-deduplicated version at https: //hf.co/Big Code. |
| Dataset Splits | No | The paper mentions evaluating on existing benchmarks like Human Eval and MBBP, describing their characteristics (e.g., Human Eval with 164 programming problems, MBPP test set of 500 examples). However, it does not specify explicit training/test/validation splits (e.g., percentages or counts) for its own 'The Stack' dataset used for training models. |
| Hardware Specification | No | The paper states: "Lastly, we are grateful to Service Now and Hugging Face for the provided compute resources." This acknowledges compute resources but does not provide specific hardware details like GPU models or CPU types. |
| Software Dependencies | No | The paper mentions: "We use a fork14 of Megatron-LM (Shoeybi et al., 2019) for training." and "The Byte-Pair Encoding tokenizer was trained on a 50-50 mixture of the Pile (Gao et al., 2020) and Python files from The Stack." While Megatron-LM is named, no specific version number is provided for it or any other key software dependencies or libraries. |
| Experiment Setup | Yes | We opt for a 350M parameter model with 24 layers, a hidden dimension of 1024, 16 attention heads, and a sequence length of 2048. The model is trained for 300K iterations with a global batch size of 384 using Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.95, ϵ = 10 8 and a weight decay of 0.1. The learning rate set to 3 10 4 is warmed up for 175 steps, then follows a cosine decay. The model processes 235.9B tokens during training. |