FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

Authors: Shikhar Tuli, Bhishma Dedhia, Shreshth Tuli, Niraj K. Jha

JAIR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A comprehensive set of experiments shows that the proposed policy, when applied to the Flexi BERT design space, pushes the performance frontier upwards compared to traditional models. Flexi BERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score.
Researcher Affiliation Academia Shikhar Tuli EMAIL Bhishma Dedhia EMAIL Dept. of Electrical & Computer Engineering, Princeton University Princeton, NJ 08544 USA Shreshth Tuli EMAIL Department of Computing, Imperial College London London, SW7 2AZ UK Niraj K. Jha EMAIL Dept. of Electrical & Computer Engineering, Princeton University Princeton, NJ 08544 USA
Pseudocode Yes Algorithm 1 summarizes the BOSHNAS workflow. Starting from an initial pre-trained set δ in the first level of the hierarchy G1, we run until convergence the following steps in a multi-worker compute cluster.
Open Source Code Yes All the code for the Flexi BERT pipeline is available at https://github.com/jha-lab/txf_design-space. The code for running BOSHNAS on any tabular dataset of deep learning architectures is available at https://github.com/jha-lab/boshnas.
Open Datasets Yes We pre-train our models with a combination of publicly available text corpora, viz. Book Corpus (Book C) (Zhu et al., 2015), Wikipedia English (Wiki), Open Web Text (OWT) (Gokaslan & Cohen, 2019), and CC-News (CCN) (Mackenzie et al., 2020). We borrow most training hyperparameters from Ro BERTa. ... For pre-training, we add the C4 dataset (Raffel et al., 2020) and train for 3,000,000 steps before fine-tuning.
Dataset Splits Yes Table 7 shows the best hyperparameters for fine-tuning of each GLUE (Super GLUE) task selected using this auto-tuning technique. ... We report MNLI on the matched set. ... The Flexi BERT-Mini model only optimizes performance on the first eight tasks for a fair comparison with NAS-BERT.
Hardware Specification Yes All models were trained on NVIDIA A100 GPUs and 2.6 GHz AMD EPYC Rome processors.
Software Dependencies No We borrow most training hyperparameters from Ro BERTa. We set the batch size to 256, learning rate warmed up over the first 10,000 steps to its peak value at 1 10 5 that then decays linearly, weight decay to 0.01, Adam scheduler s parameters β1 = 0.9, β2 = 0.98 (shown to improve stability; Liu et al., 2019), ϵ = 1 10 6, and run pre-training for 1,000,000 steps.
Experiment Setup Yes We set the batch size to 256, learning rate warmed up over the first 10,000 steps to its peak value at 1 10 5 that then decays linearly, weight decay to 0.01, Adam scheduler s parameters β1 = 0.9, β2 = 0.98 (shown to improve stability; Liu et al., 2019), ϵ = 1 10 6, and run pre-training for 1,000,000 steps. Once we find the best models, we pre-train and fine-tune the selected models with a larger compute budget. For pre-training, we add the C4 dataset (Raffel et al., 2020) and train for 3,000,000 steps before fine-tuning. We also fine-tune on each GLUE task for 10 epochs instead of 5 (further details given below). ... While running BOSHNAS, we fine-tune our models on the nine GLUE tasks over five epochs and a batch size of 64, where we implement early stopping. We also run automatic hyperparameter tuning for the fine-tuning process using the Tree-structured Parzen Estimator algorithm (Akiba et al., 2019). The learning rate is randomly selected logarithmically in the [2 10 5, 5 10 4] range, and the batch size in {32, 64, 128} uniformly. Table 7 (Table 8) shows the best hyperparameters for fine-tuning of each GLUE (Super GLUE) task selected using this auto-tuning technique.