FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?
Authors: Shikhar Tuli, Bhishma Dedhia, Shreshth Tuli, Niraj K. Jha
JAIR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A comprehensive set of experiments shows that the proposed policy, when applied to the Flexi BERT design space, pushes the performance frontier upwards compared to traditional models. Flexi BERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. |
| Researcher Affiliation | Academia | Shikhar Tuli EMAIL Bhishma Dedhia EMAIL Dept. of Electrical & Computer Engineering, Princeton University Princeton, NJ 08544 USA Shreshth Tuli EMAIL Department of Computing, Imperial College London London, SW7 2AZ UK Niraj K. Jha EMAIL Dept. of Electrical & Computer Engineering, Princeton University Princeton, NJ 08544 USA |
| Pseudocode | Yes | Algorithm 1 summarizes the BOSHNAS workflow. Starting from an initial pre-trained set δ in the first level of the hierarchy G1, we run until convergence the following steps in a multi-worker compute cluster. |
| Open Source Code | Yes | All the code for the Flexi BERT pipeline is available at https://github.com/jha-lab/txf_design-space. The code for running BOSHNAS on any tabular dataset of deep learning architectures is available at https://github.com/jha-lab/boshnas. |
| Open Datasets | Yes | We pre-train our models with a combination of publicly available text corpora, viz. Book Corpus (Book C) (Zhu et al., 2015), Wikipedia English (Wiki), Open Web Text (OWT) (Gokaslan & Cohen, 2019), and CC-News (CCN) (Mackenzie et al., 2020). We borrow most training hyperparameters from Ro BERTa. ... For pre-training, we add the C4 dataset (Raffel et al., 2020) and train for 3,000,000 steps before fine-tuning. |
| Dataset Splits | Yes | Table 7 shows the best hyperparameters for fine-tuning of each GLUE (Super GLUE) task selected using this auto-tuning technique. ... We report MNLI on the matched set. ... The Flexi BERT-Mini model only optimizes performance on the first eight tasks for a fair comparison with NAS-BERT. |
| Hardware Specification | Yes | All models were trained on NVIDIA A100 GPUs and 2.6 GHz AMD EPYC Rome processors. |
| Software Dependencies | No | We borrow most training hyperparameters from Ro BERTa. We set the batch size to 256, learning rate warmed up over the first 10,000 steps to its peak value at 1 10 5 that then decays linearly, weight decay to 0.01, Adam scheduler s parameters β1 = 0.9, β2 = 0.98 (shown to improve stability; Liu et al., 2019), ϵ = 1 10 6, and run pre-training for 1,000,000 steps. |
| Experiment Setup | Yes | We set the batch size to 256, learning rate warmed up over the first 10,000 steps to its peak value at 1 10 5 that then decays linearly, weight decay to 0.01, Adam scheduler s parameters β1 = 0.9, β2 = 0.98 (shown to improve stability; Liu et al., 2019), ϵ = 1 10 6, and run pre-training for 1,000,000 steps. Once we find the best models, we pre-train and fine-tune the selected models with a larger compute budget. For pre-training, we add the C4 dataset (Raffel et al., 2020) and train for 3,000,000 steps before fine-tuning. We also fine-tune on each GLUE task for 10 epochs instead of 5 (further details given below). ... While running BOSHNAS, we fine-tune our models on the nine GLUE tasks over five epochs and a batch size of 64, where we implement early stopping. We also run automatic hyperparameter tuning for the fine-tuning process using the Tree-structured Parzen Estimator algorithm (Akiba et al., 2019). The learning rate is randomly selected logarithmically in the [2 10 5, 5 10 4] range, and the batch size in {32, 64, 128} uniformly. Table 7 (Table 8) shows the best hyperparameters for fine-tuning of each GLUE (Super GLUE) task selected using this auto-tuning technique. |