reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BitNet: 1-bit Pre-training for Large Language Models

Authors: Hongyu Wang, Shuming Ma, Lingxiao Ma, Lei Wang, Wenhui Wang, Li Dong, Shaohan Huang, Huaijie Wang, Jilong Xue, Ruiping Wang, Yi Wu, Furu Wei

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Bit Net b1 achieves competitive performance, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. We evaluate Bit Net b1 and Bit Net b1.58 on a range of language modeling benchmarks.
Researcher Affiliation	Collaboration	Hongyu Wang and Ruiping Wang are with the Key Laboratory of AI Safety of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS and University of Chinese Academy of Sciences, Beijing. Shuming Ma, Lingxiao Ma, Wenhui Wang, Li Dong, Shaohan Huang, Jilong Xue and Furu Wei are with Microsoft Research. Lei Wang is with University of Chinese Academy of Sciences. Huaijie Wang and Yi Wu are with the Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing.
Pseudocode	Yes	Furthermore, we also present the pseudocodes of Bit Net b1.58 and Bit Net b1 in the Appendix D and E, respectively. (Appendices D and E contain Python code snippets implementing Bit Linear).
Open Source Code	No	The paper refers to external tools like 'Faster Transformer' and 'lm-evaluation-harness' but does not provide an explicit statement about open-sourcing the code for the methodology described in this paper, nor does it provide a direct link to a code repository. Appendices D and E contain PyTorch implementation snippets for illustration, but these do not constitute concrete access to the full source code for the methodology.
Open Datasets	Yes	The models are trained on an English-language corpus, which consists of the Pile dataset, Common Crawl snapshots, Real News, and CC-Stories datasets. We pre-trained the models on the Red Pajama dataset (Computer, 2023) for 100 billion tokens. We test both the 0-shot and 4-shot results on four downstream tasks, including Hellaswag (Zellers et al., 2019), Winogrande (Sakaguchi et al., 2020), Winograd (Levesque et al., 2012), and Storycloze (Mostafazadeh et al., 2016).
Dataset Splits	No	The paper mentions various datasets used for training and evaluation (e.g., The Pile dataset, Red Pajama, Hellaswag, Winogrande), but it does not explicitly provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or explicit references to how the data was partitioned) needed to reproduce the experiments. While 'validation perplexity' is mentioned, specific split details are absent.
Hardware Specification	Yes	We compare the throughput of Bit Net b1.58 and LLa MA LLM with 70B parameters on two 80GB A100 cards, using pipeline parallelism (Huang et al., 2019) so that LLa MA LLM 70B could be run on the devices.
Software Dependencies	No	The paper mentions using 'nn.Linear in Py Torch' and refers to 'Faster Transformer' and 'lm-evaluation-harness' codebases, but it does not specify version numbers for PyTorch or any other software dependencies needed to replicate the experiment.
Experiment Setup	Yes	Table 10 presents the configuration and hyper-parameters for Bit Net b1 in the scaling experiments in each model size. The dropout and gradient clipping are disabled during pre-training. For 13B and 30B model, we set weight decay to 0.05 for training stability. ... We report the configurations and hyper-parameters for Bit Net b1.58 in Table 12 and Table 13, respectively. For all experiments, the sequence length is set as 2,048 tokens. The batch size is 512, resulting in up to 1M tokens. The Adam β is set as (0.9, 0.95).