reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CroissantLLM: A Truly Bilingual French-English Language Model

Authors: Manuel Faysse, Patrick Fernandes, Nuno M Guerreiro, António Loison, Duarte Miguel Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro Henrique Martins, Antoni Bigata Casademunt, François Yvon, Andre Martins, Gautier Viaud, CELINE HUDELOT, Pierre Colombo

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Croissant LLM, a 1.3B language model pre-trained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, French Bench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework (Bommasani et al., 2023) and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives.
Researcher Affiliation	Collaboration	Manuel Faysse1,5 Patrick Fernandes6,8,11 Nuno M. Guerreiro2,5,6,8 António Loison1 Duarte M. Alves6,8 Caio Corro9 Nicolas Boizard4,5 João Alves2 Ricardo Rei2,7,8 Pedro H. Martins2 Antoni Bigata Casademunt10 François Yvon9 André F.T. Martins2,6,8 Gautier Viaud1 Céline Hudelot5 Pierre Colombo3,5 1Illuin Technology 2Unbabel 3Equall 4Diabolocom 5MICS, Centrale Supélec 6Instituto de Telecomunicações, Lisboa 7INESC-ID, Lisboa 8Instituto Superior Técnico, Universidade de Lisboa 9Sorbonne Université, CNRS, ISIR, Paris 10Imperial College London 11Language Technologies Institute, Carnegie Mellon University This work is a collaboration of academic and industrial partners. On the academic side, core authors are affiliated with Centrale Supélec (Université Paris Saclay) and Instituto Superior Técnico de Lisboa, and other contributors are linked to Sorbonne Université and Imperial College London. On the industrial side, core authors receive funding from respectively Illuin Technology (Paris), Unbabel (Lisboa), Equall (New York, Lisboa, Paris).
Pseudocode	No	The paper includes a code example for Depth First Search in Appendix C.4, but this is an illustrative example, not a structured pseudocode or algorithm block for the main methodology of the Croissant LLM training. The core methodology of the LLM itself is a transformer architecture, which is described textually rather than through pseudocode.
Open Source Code	Yes	Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. Code for dataset collection and filtering is available at https://github.com/Manuel Fay/llm-data-hub. Code for model training is hosted at https://github.com/Coder Pat/croissant-llm-training. Datasets and model checkpoints are available at https://huggingface.co/Croissant LLM.
Open Datasets	Yes	We introduce Croissant LLM, a 1.3B language model pre-trained on a set of 3T English and French tokens... We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. Datasets and model checkpoints are available at https://huggingface.co/Croissant LLM. Our English data is primarily drawn from the Slim Pajama corpus (Soboleva et al., 2023)... We extract subsets of sentence pairs spanning multiple domains from the OPUS corpus (Tiedemann, 2012).
Dataset Splits	Yes	FQuad (d Hoffschmidt et al., 2020) is a French Question Answering dataset... we rely on its public evaluation split for 4 of the French Bench tasks. Translation capabilities are evaluated through the test set of the 2014 WMT French-English and English-French tasks (Alves et al., 2023). Belebele is a challenging reading comprehension dataset, with multiple choices, released across 122 languages in parallel format (Bandarkar et al., 2023). We leverage the English and French splits.
Hardware Specification	Yes	Training is done on a dedicated Nvidia A100 SXM4 80 Gb supercomputer partition with 30 octo-GPU nodes.
Software Dependencies	No	We train our models on a modified version of Megatron-Deepspeed, a training framework built on top of Py Torch. ... We rely on the Hugging Face Transformers and Datasets library for model and data manipulation. We use Sentence Piece to train a Byte-Pair Encoding tokenizer. While several software components are mentioned, specific version numbers for Py Torch, Megatron-Deepspeed, Hugging Face Transformers/Datasets, or Sentence Piece are not provided.
Experiment Setup	Yes	We set the micro-batch size per device to 8 sequences of length 2048, and use 4 gradient accumulation steps, resulting in a total batch size of 8 * 4 * 30 * 8 = 7680 samples, or 7680 * 2048 = 15,728,640 tokens. Standard Cross-Entropy losses are used on a Causal Language Modeling objective. We train with a max learning rate of 3e-4, 1000 warmup steps, and a cosine learning rate with a minimum value of 1e-5.