BitNet: 1-bit Pre-training for Large Language Models

Authors: Hongyu Wang, Shuming Ma, Lingxiao Ma, Lei Wang, Wenhui Wang, Li Dong, Shaohan Huang, Huaijie Wang, Jilong Xue, Ruiping Wang, Yi Wu, Furu Wei

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Bit Net b1 achieves competitive performance, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. We evaluate Bit Net b1 and Bit Net b1.58 on a range of language modeling benchmarks.
Researcher Affiliation Collaboration Hongyu Wang and Ruiping Wang are with the Key Laboratory of AI Safety of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS and University of Chinese Academy of Sciences, Beijing. Shuming Ma, Lingxiao Ma, Wenhui Wang, Li Dong, Shaohan Huang, Jilong Xue and Furu Wei are with Microsoft Research. Lei Wang is with University of Chinese Academy of Sciences. Huaijie Wang and Yi Wu are with the Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing.
Pseudocode Yes Furthermore, we also present the pseudocodes of Bit Net b1.58 and Bit Net b1 in the Appendix D and E, respectively. (Appendices D and E contain Python code snippets implementing Bit Linear).
Open Source Code No The paper refers to external tools like 'Faster Transformer' and 'lm-evaluation-harness' but does not provide an explicit statement about open-sourcing the code for the methodology described in this paper, nor does it provide a direct link to a code repository. Appendices D and E contain PyTorch implementation snippets for illustration, but these do not constitute concrete access to the full source code for the methodology.
Open Datasets Yes The models are trained on an English-language corpus, which consists of the Pile dataset, Common Crawl snapshots, Real News, and CC-Stories datasets. We pre-trained the models on the Red Pajama dataset (Computer, 2023) for 100 billion tokens. We test both the 0-shot and 4-shot results on four downstream tasks, including Hellaswag (Zellers et al., 2019), Winogrande (Sakaguchi et al., 2020), Winograd (Levesque et al., 2012), and Storycloze (Mostafazadeh et al., 2016).
Dataset Splits No The paper mentions various datasets used for training and evaluation (e.g., The Pile dataset, Red Pajama, Hellaswag, Winogrande), but it does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or explicit references to how the data was partitioned) needed to reproduce the experiments. While 'validation perplexity' is mentioned, specific split details are absent.
Hardware Specification Yes We compare the throughput of Bit Net b1.58 and LLa MA LLM with 70B parameters on two 80GB A100 cards, using pipeline parallelism (Huang et al., 2019) so that LLa MA LLM 70B could be run on the devices.
Software Dependencies No The paper mentions using 'nn.Linear in Py Torch' and refers to 'Faster Transformer' and 'lm-evaluation-harness' codebases, but it does not specify version numbers for PyTorch or any other software dependencies needed to replicate the experiment.
Experiment Setup Yes Table 10 presents the configuration and hyper-parameters for Bit Net b1 in the scaling experiments in each model size. The dropout and gradient clipping are disabled during pre-training. For 13B and 30B model, we set weight decay to 0.05 for training stability. ... We report the configurations and hyper-parameters for Bit Net b1.58 in Table 12 and Table 13, respectively. For all experiments, the sequence length is set as 2,048 tokens. The batch size is 512, resulting in up to 1M tokens. The Adam β is set as (0.9, 0.95).