reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SWEb: A Large Web Dataset for the Scandinavian Languages

Authors: Tobias Norlund, Tim Isbister, Amaru Cuba Gyllensten, Paul dos Santos, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To assess this question we conduct experiments against the recently proposed Fine Web pipeline (Penedo et al., 2024). We do this be performing a data ablation experiment. Here, we train two language models on data produced by 1) our pipeline and 2) the Fine Web pipeline respectively. We then evaluate the language models as a proxy for evaluating the datasets and, in turn, the pipelines. (...) In our experiments, we see early and consistently increased performance as we train on successively more data, which speaks for it being a suitable indicator for performance at larger scales. (...) Next, we evaluate MSW and MFW on HP-MEK, and plot learning curves in Figure 8.
Researcher Affiliation	Industry	Tobias Norlund , Tim Isbister, Amaru Cuba Gyllensten, Paul Dos Santos, Danila Petrelli, Ariel Ekgren, Magnus Sahlgren AI Sweden
Pseudocode	No	The paper describes the pipeline steps in text and flowcharts (Figure 1), and uses mathematical formulas (equations 1 and 2), but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	All data, models and code are shared openly. (...) Code and extractor model is available here: https://github.com/aidotse/SWEb
Open Datasets	Yes	We release the largest to date pretraining dataset for the Scandinavian languages: Scandinavian WEb (SWEb). (...) Data available here: https://huggingface.co/datasets/AI-Sweden-Models/SWEb
Dataset Splits	Yes	We split the two datasets in 90/10 train/test splits and tokenize using the GPT-SW3 tokenizer (Ekgren et al., 2024). Then, we train small language models on each training set respectively (MSW for SWEb and MFW for Fine Web)...
Hardware Specification	Yes	In extracting SWEb, we consumed 20k AMD MI250X GPU-hours which is a significant amount, but comparing to the budgets required for training the downstream LLMs it is still negligable.
Software Dependencies	No	The paper mentions software like Pandoc, the Longformer architecture, the Adam optimizer, Fix Text For You (ftfy), and the GPT-SW3 tokenizer. However, specific version numbers for these software components are not provided within the main text.
Experiment Setup	Yes	We use the Adam optimizer with a constant learning rate of 1e-5. (...) We split the two datasets in 90/10 train/test splits and tokenize using the GPT-SW3 tokenizer (Ekgren et al., 2024). Then, we train small language models on each training set respectively (MSW for SWEb and MFW for Fine Web), and use the Llama architecture with 1.82B parameters (including embeddings) with a 2048 sequence length, a global batch size of 2 million tokens and a cosine decay learning rate schedule. Each model is trained for 10,811 steps, which corresponds to one full epoch for SWEb, and 1.6 epochs for Fine Web. We checkpoint every 250 steps to evaluate progression throughout training.