NeoBERT: A Next Generation BERT
Authors: Lola Le Breton, Quentin Fournier, John Xavier Morris, Mariam El Mezouar, Sarath Chandar
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Neo BERT... achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERTlarge, Ro BERTalarge, Nomic BERT, and Modern BERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. ... We conduct a series of ablations in controlled settings to evaluate our improvements to the original BERT architecture. ... Section 5 Experiments |
| Researcher Affiliation | Collaboration | Lola Le Breton1,2,3 Quentin Fournier2 John X. Morris4 Mariam El Mezouar5 Sarath Chandar1,2,3,6 1Chandar Research Lab 2Mila Quebec AI Institute 3Polytechnique Montréal 4Cornell University 5Royal Military College of Canada 6Canada CIFAR AI Chair |
| Pseudocode | No | The paper describes methods and procedures in narrative text, but does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption 1,2. ... 1https://huggingface.co/chandar-lab/Neo BERT 2https://github.com/chandar-lab/Neo BERT |
| Open Datasets | Yes | Following the same trend, we pre-trained Neo BERT on Refined Web (Penedo et al., 2023), a massive dataset containing 600B tokens... The GLUE benchmark (Wang et al., 2019) is a cornerstone of language modeling evaluations... we consider the more recent and challenging MTEB benchmark (Muennighoff et al., 2023)... We trained on the following fully-open datasets: AG-News (Zhang et al., 2016), All-NLI (Bowman et al., 2015; Williams et al., 2018), Amazon QA (Gupta et al., 2019), Concurrent QA (Arora et al., 2022), Git Hub Issues (Li & Li, 2023), Goo AQ (Khashabi et al., 2021), Med MCQA (Pal et al., 2022), NPR4, Pud Med QA (Jin et al., 2019), Sentence Compression (Filippova & Altun, 2013) Stack Exchange5, Trivia QA (Han et al., 2019), Wikihow (Koupaee & Wang, 2018), Yahoo! Answers (Zhang et al., 2016) as well as the available training splits of MTEB datasets (Stack Over Flow Dup Question, Fever (Thorne et al., 2018), MS MARCO (Bajaj et al., 2018), STS12, and STSBenchmark (Cer et al., 2017)). |
| Dataset Splits | Yes | Following standard practices, we fine-tune Neo BERT on the development set of GLUE with a classical hyperparameter search... We fine-tune on the training splits of every glue dataset for 10 epochs, with evaluation on the validation splits every n steps... |
| Hardware Specification | Yes | Neo BERT was trained on 8 H100 for 1,050,000 steps, for a total of 6,000 GPU hours. ... Inference is performed for 100 steps on a single A100 GPU |
| Software Dependencies | No | The paper mentions several software components and libraries, such as Adam W optimizer, Deep Speed, ZeRO optimizer, xFormers library, and Flash Attention. However, it does not provide specific version numbers for any of these components. |
| Experiment Setup | Yes | In the first stage, we train the model for 1M steps (2T tokens) using sequences truncated to a maximum length of 1, 024 tokens... In the second stage, we extend the training for an additional 50k steps (100B tokens), increasing the maximum sequence length to 4, 096 tokens. ... we use the Adam W optimizer (Loshchilov & Hutter, 2019) with the same hyperparameters as LLa MA 2: β1 = 0.9, β2 = 0.95, and ϵ = 10 8. ... a linear warmup for 2, 000 steps to reach a peak learning rate of 6 10 4, followed by a cosine decay to 10% of the peak learning rate over 90% of the training steps. ... We use a weight decay of 0.1 and apply gradient clipping with a maximum norm of 1.0. ... We use a local batch size of 32, 8 gradient accumulation steps, and a maximum sequence length of 1, 024, for a total batch size of 2M tokens. ... We perform a classical parameter search with learning rates in {5e 6, 6e 6, 1e 5, 2e 5, 3e 5}, batch sizes in {4, 8, 16, 32} and weight decay in {1e 2, 1e 5}. ... Table 6: Optimal hyperparameters for GLUE tasks. |