SWEb: A Large Web Dataset for the Scandinavian Languages

Authors: Tobias Norlund, Tim Isbister, Amaru Cuba Gyllensten, Paul dos Santos, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess this question we conduct experiments against the recently proposed Fine Web pipeline (Penedo et al., 2024). We do this be performing a data ablation experiment. Here, we train two language models on data produced by 1) our pipeline and 2) the Fine Web pipeline respectively. We then evaluate the language models as a proxy for evaluating the datasets and, in turn, the pipelines. (...) In our experiments, we see early and consistently increased performance as we train on successively more data, which speaks for it being a suitable indicator for performance at larger scales. (...) Next, we evaluate MSW and MFW on HP-MEK, and plot learning curves in Figure 8.
Researcher Affiliation Industry Tobias Norlund , Tim Isbister, Amaru Cuba Gyllensten, Paul Dos Santos, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren AI Sweden
Pseudocode No The paper describes the pipeline steps in text and flowcharts (Figure 1), and uses mathematical formulas (equations 1 and 2), but does not include any explicit pseudocode blocks or algorithms.
Open Source Code Yes All data, models and code are shared openly. (...) Code and extractor model is available here: https://github.com/aidotse/SWEb
Open Datasets Yes We release the largest to date pretraining dataset for the Scandinavian languages: Scandinavian WEb (SWEb). (...) Data available here: https://huggingface.co/datasets/AI-Sweden-Models/SWEb
Dataset Splits Yes We split the two datasets in 90/10 train/test splits and tokenize using the GPT-SW3 tokenizer (Ekgren et al., 2024). Then, we train small language models on each training set respectively (MSW for SWEb and MFW for Fine Web)...
Hardware Specification Yes In extracting SWEb, we consumed 20k AMD MI250X GPU-hours which is a significant amount, but comparing to the budgets required for training the downstream LLMs it is still negligable.
Software Dependencies No The paper mentions software like Pandoc, the Longformer architecture, the Adam optimizer, Fix Text For You (ftfy), and the GPT-SW3 tokenizer. However, specific version numbers for these software components are not provided within the main text.
Experiment Setup Yes We use the Adam optimizer with a constant learning rate of 1e-5. (...) We split the two datasets in 90/10 train/test splits and tokenize using the GPT-SW3 tokenizer (Ekgren et al., 2024). Then, we train small language models on each training set respectively (MSW for SWEb and MFW for Fine Web), and use the Llama architecture with 1.82B parameters (including embeddings) with a 2048 sequence length, a global batch size of 2 million tokens and a cosine decay learning rate schedule. Each model is trained for 10,811 steps, which corresponds to one full epoch for SWEb, and 1.6 epochs for Fine Web. We checkpoint every 250 steps to evaluate progression throughout training.