reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs

Authors: Peijie Dong, Lujun Li, Yuedong Zhong, DaYou Du, Ruibo FAN, Yuhan CHEN, Zhenheng Tang, Qiang Wang, Wei Xue, Yike Guo, Xiaowen Chu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on LLa MA, OPT, and Mistral family. STBLLM achieves a perplexity of 11.07 at 0.55 bits per weight, outperforming the Bi LLM by 3 .
Researcher Affiliation	Academia	1 HKUST(GZ) 2 HKUST 3 SYSU 4 HIT(SZ) EMAIL, EMAIL, EMAIL, EMAIL, EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Framework of STBLLM: Details of each function are shown in Algorithm 2. Algorithm 2 STBLLM
Open Source Code	Yes	Code is released at https://github.com/pprp/STBLLM.
Open Datasets	Yes	We measure the perplexity for language generation tasks on Wikitext2 (Merity et al., 2016), C4 (Raffel et al., 2020) and PTB (Marcus et al., 1993), and accuracy for the zero-shot tasks including Winogrande (Sakaguchi et al., 2021), OBQA (Mihaylov et al., 2018), Hellaswag (Zellers et al., 2019), Bool Q (Clark et al., 2019), ARC (Clark et al., 2018) and RTE (Chakrabarty et al., 2021).
Dataset Splits	Yes	For perplexity evaluation in Table 2 and 3, we employ the C4 dataset as the calibration dataset and report the perplexity on Wikitext2. We conduct experiments on LLa MA-1/2/3 (Touvron et al., 2023a;b), OPT (Zhang et al., 2022a), and Mistral (Jiang et al., 2023). ... We extend our experiments to 7 zero-shot datasets on LLa MA-1-13B, LLa MA-2-13B, and LLa MA-1-30B, each tested with Full Precision, Bi LLM(6:8), Bi LLM(4:8), STBLLM(6:8), and STBLLM(4:8) methods.
Hardware Specification	Yes	Most LLMs except 65B can be evaluated on a single NVIDIA A800 GPU. For the LLa MA-1-65B model, we employ four NVIDIA A800 GPUs for evaluation. It takes 1.8 hours for the post-training process of 7B models on an RTX 4090 GPU and 2.8 hours for 13B models on an A6000 GPU.
Software Dependencies	No	Our STBLLM utilizes Py Torch (Paszke et al., 2019) and Huggingface (Wolf et al., 2019) libraries.
Experiment Setup	Yes	For a fair comparison, we set the same block size to 128. ... We compare the results of STBLLM with Bi LLM under the same N:M settings. For more information on average bits under N:M settings, please refer to Table 1. ... We evaluate the perplexity of LLa MA-1-7B and LLa MA-2-7B with group sizes of 64, 128, 256, and 512. Generally, as the group size increases, performance improves. However, this also results in higher computational and storage demands. We choose a group size of 128 to balance performance and resource consumption.