SWEb: A Large Web Dataset for the Scandinavian Languages
Authors: Tobias Norlund, Tim Isbister, Amaru Cuba Gyllensten, Paul dos Santos, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To assess this question we conduct experiments against the recently proposed Fine Web pipeline (Penedo et al., 2024). We do this be performing a data ablation experiment. Here, we train two language models on data produced by 1) our pipeline and 2) the Fine Web pipeline respectively. We then evaluate the language models as a proxy for evaluating the datasets and, in turn, the pipelines. (...) In our experiments, we see early and consistently increased performance as we train on successively more data, which speaks for it being a suitable indicator for performance at larger scales. (...) Next, we evaluate MSW and MFW on HP-MEK, and plot learning curves in Figure 8. |
| Researcher Affiliation | Industry | Tobias Norlund , Tim Isbister, Amaru Cuba Gyllensten, Paul Dos Santos, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren AI Sweden |
| Pseudocode | No | The paper describes the pipeline steps in text and flowcharts (Figure 1), and uses mathematical formulas (equations 1 and 2), but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | All data, models and code are shared openly. (...) Code and extractor model is available here: https://github.com/aidotse/SWEb |
| Open Datasets | Yes | We release the largest to date pretraining dataset for the Scandinavian languages: Scandinavian WEb (SWEb). (...) Data available here: https://huggingface.co/datasets/AI-Sweden-Models/SWEb |
| Dataset Splits | Yes | We split the two datasets in 90/10 train/test splits and tokenize using the GPT-SW3 tokenizer (Ekgren et al., 2024). Then, we train small language models on each training set respectively (MSW for SWEb and MFW for Fine Web)... |
| Hardware Specification | Yes | In extracting SWEb, we consumed 20k AMD MI250X GPU-hours which is a significant amount, but comparing to the budgets required for training the downstream LLMs it is still negligable. |
| Software Dependencies | No | The paper mentions software like Pandoc, the Longformer architecture, the Adam optimizer, Fix Text For You (ftfy), and the GPT-SW3 tokenizer. However, specific version numbers for these software components are not provided within the main text. |
| Experiment Setup | Yes | We use the Adam optimizer with a constant learning rate of 1e-5. (...) We split the two datasets in 90/10 train/test splits and tokenize using the GPT-SW3 tokenizer (Ekgren et al., 2024). Then, we train small language models on each training set respectively (MSW for SWEb and MFW for Fine Web), and use the Llama architecture with 1.82B parameters (including embeddings) with a 2048 sequence length, a global batch size of 2 million tokens and a cosine decay learning rate schedule. Each model is trained for 10,811 steps, which corresponds to one full epoch for SWEb, and 1.6 epochs for Fine Web. We checkpoint every 250 steps to evaluate progression throughout training. |