reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

Authors: Kaiwen Tang, Zhanglu Yan, Weng-Fai Wong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate Sorbet through extensive testing on the GLUE benchmark and a series of ablation studies, demonstrating its potential as an energy-efficient solution for language model inference. Our tests on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al.) demonstrate that Sorbet maintains stable performance while achieving energy savings of 27.16 compared to BERT and 3.16 compared to Spike LM. To evaluate the contribution of our proposed components, we conducted a series of ablation experiments.
Researcher Affiliation	Academia	1School of Computing, National University of Singapore, Singapore, Singapore. Correspondence to: Zhanglu Yan <EMAIL>.
Pseudocode	Yes	The complete BSPN algorithm is outlined in Algorithm 1, where y X is denoted as [y1, X1 :, ..., yd, Xd :]. Algorithm 1 Bit Shifting Power Norm (BSPN). The complete PTsoftmax computation is detailed in Algorithm 2. Algorithm 2 Power-of-two Softmax (PTsoftmax). This entire procedure is summarized in Algorithm 3. Algorithm 3 Multi-step distillation. We provide the detailed spike generation method we adopted in algorithm 4, where wl ij is the weight of layer l from neuron i to neuron j, bl i is the bias of the neuron i in layer l, sl i is the input spike train of the neuron i, and T is the time window size. Algorithm 4 Average IF model.
Open Source Code	Yes	Our code is publicly available at https://github.com/Kaiwen Tang/Sorbet
Open Datasets	Yes	Our tests on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al.) demonstrate that Sorbet maintains stable performance while achieving energy savings of 27.16 compared to BERT and 3.16 compared to Spike LM. We evaluate our Sorbet on 7 distinct datasets in the GLUE benchmark as follows: MNLI: The MNLI (Multi-Genre Natural Language Inference Corpus)... QQP: The QQP (Quora Question Pairs)... QNLI: The QNLI (Question-answering Natural Language Inference)... SST-2: The SST-2 (Stanford Sentiment Treebank)... STS-B: The STS-B (Semantic Textual Similarity Benchmark)... RTE: The RTE (Recognizing Textual Entailment datasets)... MRPC: The MRPC (Microsoft Research Paraphrase Corpus)...
Dataset Splits	Yes	We evaluate our Sorbet on 7 distinct datasets in the GLUE benchmark as follows: MNLI: The MNLI (Multi-Genre Natural Language Inference Corpus) is involved in natural language inference tasks. It consists of a collection of sentence pairs annotated for textual entailment through crowdsourcing. QQP: The QQP (Quora Question Pairs) pertains to tasks involving similarity and paraphrase identification, focusing on pairs of questions from the community Q&A website, Quora. The primary objective of this task is to ascertain whether a pair of questions are semantically equivalent. QNLI: The QNLI (Question-answering Natural Language Inference) is a task in natural language inference. QNLI is derived from another dataset, the Stanford Question Answering Dataset (SQu AD 1.0), which is a question-paragraph pair question-answering dataset where the paragraphs are sourced from Wikipedia. SST-2: The SST-2 (Stanford Sentiment Treebank) is a single-sentence classification task that involves sentences from movie reviews and their sentiment annotations by humans. This task requires classifying the sentiment of a given sentence into positive and negative sentiment. STS-B: The STS-B (Semantic Textual Similarity Benchmark) comprises a collection of sentence pairs extracted from sources such as news headlines, video titles, image captions, and natural language inference data. It is a regression task. RTE: The RTE (Recognizing Textual Entailment datasets) are from natural language inference tasks. It consolidates datasets from a series of annual textual entailment challenges, with data samples constructed from news sources and Wikipedia. MRPC: The MRPC (Microsoft Research Paraphrase Corpus) is involved in similarity and paraphrase tasks. It consists of sentence pairs automatically extracted from online news sources, with human annotations to determine if the sentences are semantically equivalent. The categories are not balanced, with 68% of the samples being positive instances.
Hardware Specification	Yes	All experiments were conducted on three Nvidia RTX A100 GPUs, each with 80GB of memory.
Software Dependencies	No	Although we do not have access to physical neuromorphic chips, to demonstrate the neuromorphic hardware compatibility of our proposed model, we have implemented and validated the PTsoftmax and BSPN layers using the Lava framework, targeting Intel s Loihi architecture. While these methods demonstrate substantial energy-efficiency gains, they cannot fully capture real hardware effects.
Experiment Setup	Yes	Our experiments use BERT-base as the initial teacher model for distillation. The number of timesteps used for all results in this section is 16. Firstly, to enhance the energy efficiency of our model and enable the encoding of all activations into spike trains, we quantize all weights to 1 bit and activations to 4 bits. Inspired by (Liu et al., 2022), we adopt a hybrid training strategy that combines standard knowledge distillation with the distillation of intermediate activations. The overall loss function is L = Llogits + Lreps, where Llogits employs the Kullback-Leibler (KL) divergence to facilitate learning from the teacher model to the student model while Lreps is used to accelerate convergence and improve transfer and generalization capabilities (Aguilar et al., 2020). Concretely, Llogits = KL(p, q), Lreps = X i rs i rt i 2 (16) where p denotes the output distribution of the teacher model, and q represents the output of the student model. rs i and rt i are the corresponding transformer block output activations from the student and teacher models, respectively.