CroissantLLM: A Truly Bilingual French-English Language Model

Authors: Manuel Faysse, Patrick Fernandes, Nuno M Guerreiro, António Loison, Duarte Miguel Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro Henrique Martins, Antoni Bigata Casademunt, François Yvon, Andre Martins, Gautier Viaud, CELINE HUDELOT, Pierre Colombo

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Croissant LLM, a 1.3B language model pre-trained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, French Bench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework (Bommasani et al., 2023) and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives.
Researcher Affiliation Collaboration Manuel Faysse1,5 Patrick Fernandes6,8,11 Nuno M. Guerreiro2,5,6,8 António Loison1 Duarte M. Alves6,8 Caio Corro9 Nicolas Boizard4,5 João Alves2 Ricardo Rei2,7,8 Pedro H. Martins2 Antoni Bigata Casademunt10 François Yvon9 André F.T. Martins2,6,8 Gautier Viaud1 Céline Hudelot5 Pierre Colombo3,5 1Illuin Technology 2Unbabel 3Equall 4Diabolocom 5MICS, Centrale Supélec 6Instituto de Telecomunicações, Lisboa 7INESC-ID, Lisboa 8Instituto Superior Técnico, Universidade de Lisboa 9Sorbonne Université, CNRS, ISIR, Paris 10Imperial College London 11Language Technologies Institute, Carnegie Mellon University This work is a collaboration of academic and industrial partners. On the academic side, core authors are affiliated with Centrale Supélec (Université Paris Saclay) and Instituto Superior Técnico de Lisboa, and other contributors are linked to Sorbonne Université and Imperial College London. On the industrial side, core authors receive funding from respectively Illuin Technology (Paris), Unbabel (Lisboa), Equall (New York, Lisboa, Paris).
Pseudocode No The paper includes a code example for Depth First Search in Appendix C.4, but this is an illustrative example, not a structured pseudocode or algorithm block for the main methodology of the Croissant LLM training. The core methodology of the LLM itself is a transformer architecture, which is described textually rather than through pseudocode.
Open Source Code Yes Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. Code for dataset collection and filtering is available at https://github.com/Manuel Fay/llm-data-hub. Code for model training is hosted at https://github.com/Coder Pat/croissant-llm-training. Datasets and model checkpoints are available at https://huggingface.co/Croissant LLM.
Open Datasets Yes We introduce Croissant LLM, a 1.3B language model pre-trained on a set of 3T English and French tokens... We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. Datasets and model checkpoints are available at https://huggingface.co/Croissant LLM. Our English data is primarily drawn from the Slim Pajama corpus (Soboleva et al., 2023)... We extract subsets of sentence pairs spanning multiple domains from the OPUS corpus (Tiedemann, 2012).
Dataset Splits Yes FQuad (d Hoffschmidt et al., 2020) is a French Question Answering dataset... we rely on its public evaluation split for 4 of the French Bench tasks. Translation capabilities are evaluated through the test set of the 2014 WMT French-English and English-French tasks (Alves et al., 2023). Belebele is a challenging reading comprehension dataset, with multiple choices, released across 122 languages in parallel format (Bandarkar et al., 2023). We leverage the English and French splits.
Hardware Specification Yes Training is done on a dedicated Nvidia A100 SXM4 80 Gb supercomputer partition with 30 octo-GPU nodes.
Software Dependencies No We train our models on a modified version of Megatron-Deepspeed, a training framework built on top of Py Torch. ... We rely on the Hugging Face Transformers and Datasets library for model and data manipulation. We use Sentence Piece to train a Byte-Pair Encoding tokenizer. While several software components are mentioned, specific version numbers for Py Torch, Megatron-Deepspeed, Hugging Face Transformers/Datasets, or Sentence Piece are not provided.
Experiment Setup Yes We set the micro-batch size per device to 8 sequences of length 2048, and use 4 gradient accumulation steps, resulting in a total batch size of 8 * 4 * 30 * 8 = 7680 samples, or 7680 * 2048 = 15,728,640 tokens. Standard Cross-Entropy losses are used on a Causal Language Modeling objective. We train with a max learning rate of 3e-4, 1000 warmup steps, and a cosine learning rate with a minimum value of 1e-5.