Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

ALLaM: Large Language Models for Arabic and English

Authors: M Saiful Bari, Yazeed Alnumay, Norah Alzahrani, Nouf Alotaibi, Hisham Alyahya, AlRashed, Faisal Mirza, Shaykhah Alsubaie, Hassan Alahmed, Ghadah Alabduljabbar, Raghad Alkhathran, Yousef Almushayqih, Raneem Alnajim, Salman I Alsubaihi, Maryam Al Mansour, Saad Hassan, Majed Alrubaian, Ali Alammari, Zaki Alawami, Abdulmohsen Al-Thubaity, Ahmed Abdelali, Jeril Kuriakose, Abdalghani Abujabal, Nora Al-Twairesh, Areeb Alowisheq, Haidar Khan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models. Arabic assets are released in Hugging Face. 1...Figure 1: Performance on Arabic (Koto et al., 2024) and English (Hendrycks et al., 2020) MMLU Benchmarks. ALLaM (red line) shows impressive improvement from its base model Llama-2 (yellow line). All evaluations were done on the latest version of the fine-tuned (chat or instruct) models. The ALLaM 7B from scratch model also shows significant improvement over the ALLaM 7B continued pretraining model...Figure 3: Measuring the effect of adding machine translated Arabic data to pretraining...Figure 4: We determine the optimal Arabic/English language mixture...by conducting ablations over 6 Arabic/English ratios...Figure 8: Selected benchmark evaluated through ALLaM's training.
Researcher Affiliation Industry M Saiful Bari Yazeed Alnumay Norah A. Alzahrani Nouf M. Alotaibi Hisham A. Alyahya Sultan Al Rashed Faisal A. Mirza Shaykhah Z. Alsubaie Hassan A. Alahmed Ghadah Alabduljabbar Raghad Alkhathran Yousef Almushayqih Raneem Alnajim Salman Alsubaihi Maryam Al Mansour Majed Alrubaian Ali Alammari Zaki Alawami Abdulmohsen Al-Thubaity Ahmed Abdelali Jeril Kuriakose Abdalghani Abujabal Nora Al-Twairesh Areeb Alowisheq Haidar Khan National Center for AI (NCAI) Saudi Data and AI Authority (SDAIA) Riyadh, Saudi Arabia
Pseudocode No The paper describes its methodology in detailed prose and illustrates concepts with figures, such as Figure 5 showing an embedding initialization overview and Figure 12 explaining the augmentation process for conversations. However, it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Arabic assets are released in Hugging Face. 1https://huggingface.co/ALLaM-AI ...Our 13B model on the IBM Watson X platform in May 2024. Our 7B model, pretrained from scratch, is now available on Microsoft Azure.
Open Datasets Yes For English, many high quality and large scale datasets are available for pretraining (Together Computer, 2023; Soldaini et al., 2024; Gao et al., 2021; Penedo et al., 2023). We harnessed subsets from Red Pajama (Together Computer, 2023), Fine Web (Penedo et al., 2024), Dolma-v1 (Soldaini et al., 2024) and Pile (Gao et al., 2021) datasets e.g., Dolma-CC, The Stack (Kocetkov et al., 2022), Pe S2o, Pub Med, DM-Math (Saxton et al., 2019) and Stack Exchange (Soboleva et al., 2023).
Dataset Splits Yes To build a performant model in both Arabic and English, we conducted experiments to determine the optimal language mix. Figure 4 gives an overview of datamixture experiments on our curated Arabic-English corpus. We conducted the experiments with the same sampling ratio (Table 1) and data order...Table 1 shows the language and category mixing distributions for English, Arabic natural, Arabic translated and final mix...Our SFT data is curated from a diverse array of sources...Ultra-Instinct includes 12M samples evenly split between English and Arabic, while the second version (v2), is a reduced version with half the samples.
Hardware Specification Yes Over the course of our development of ALLaM, we had access to 128-1024 A100 GPUs. Our GPU cluster was equipped with InfiniBand connections to enable high-speed communication between nodes.
Software Dependencies No The paper mentions using "Megatron-LM", "Flash Attention", and "sentencepiece" for various components of the training and tokenization process. However, specific version numbers for these software dependencies are not provided in the text.
Experiment Setup Yes In all of our continued pretraining experiments, we used the final learning rate of the pretrained language model (usually 3 x 10^-5)...We match hyperparameters and architecture for pretraining from scratch with Touvron et al. (2023a), including 4M tokens per batch and max LR 3 x 10^-4 decayed to 3 x 10^-5 with a cosine schedule...We fine-tune our base model...for 3 epochs using Ultra-Instinct-v2 with a learning rate of 5 x 10^-6 and a batch size of 1024...For DPO, we used a batch size of 512 with KLpenalty = 0.1 and a learning rate of 9 x 10^-7 decayed to 5 x 10^-7 using a cosine annealing learning rate schedule. We train ALLaM for a single epoch using all the preference data.