reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

Authors: Shikhar Tuli, Bhishma Dedhia, Shreshth Tuli, Niraj K. Jha

JAIR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A comprehensive set of experiments shows that the proposed policy, when applied to the Flexi BERT design space, pushes the performance frontier upwards compared to traditional models. Flexi BERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score.
Researcher Affiliation	Academia	Shikhar Tuli EMAIL Bhishma Dedhia EMAIL Dept. of Electrical & Computer Engineering, Princeton University Princeton, NJ 08544 USA Shreshth Tuli EMAIL Department of Computing, Imperial College London London, SW7 2AZ UK Niraj K. Jha EMAIL Dept. of Electrical & Computer Engineering, Princeton University Princeton, NJ 08544 USA
Pseudocode	Yes	Algorithm 1 summarizes the BOSHNAS workﬂow. Starting from an initial pre-trained set δ in the ﬁrst level of the hierarchy G1, we run until convergence the following steps in a multi-worker compute cluster.
Open Source Code	Yes	All the code for the Flexi BERT pipeline is available at https://github.com/jha-lab/txf_design-space. The code for running BOSHNAS on any tabular dataset of deep learning architectures is available at https://github.com/jha-lab/boshnas.
Open Datasets	Yes	We pre-train our models with a combination of publicly available text corpora, viz. Book Corpus (Book C) (Zhu et al., 2015), Wikipedia English (Wiki), Open Web Text (OWT) (Gokaslan & Cohen, 2019), and CC-News (CCN) (Mackenzie et al., 2020). We borrow most training hyperparameters from Ro BERTa. ... For pre-training, we add the C4 dataset (Raﬀel et al., 2020) and train for 3,000,000 steps before ﬁne-tuning.
Dataset Splits	Yes	Table 7 shows the best hyperparameters for ﬁne-tuning of each GLUE (Super GLUE) task selected using this auto-tuning technique. ... We report MNLI on the matched set. ... The Flexi BERT-Mini model only optimizes performance on the ﬁrst eight tasks for a fair comparison with NAS-BERT.
Hardware Specification	Yes	All models were trained on NVIDIA A100 GPUs and 2.6 GHz AMD EPYC Rome processors.
Software Dependencies	No	We borrow most training hyperparameters from Ro BERTa. We set the batch size to 256, learning rate warmed up over the ﬁrst 10,000 steps to its peak value at 1 10 5 that then decays linearly, weight decay to 0.01, Adam scheduler s parameters β1 = 0.9, β2 = 0.98 (shown to improve stability; Liu et al., 2019), ϵ = 1 10 6, and run pre-training for 1,000,000 steps.
Experiment Setup	Yes	We set the batch size to 256, learning rate warmed up over the ﬁrst 10,000 steps to its peak value at 1 10 5 that then decays linearly, weight decay to 0.01, Adam scheduler s parameters β1 = 0.9, β2 = 0.98 (shown to improve stability; Liu et al., 2019), ϵ = 1 10 6, and run pre-training for 1,000,000 steps. Once we ﬁnd the best models, we pre-train and ﬁne-tune the selected models with a larger compute budget. For pre-training, we add the C4 dataset (Raﬀel et al., 2020) and train for 3,000,000 steps before ﬁne-tuning. We also ﬁne-tune on each GLUE task for 10 epochs instead of 5 (further details given below). ... While running BOSHNAS, we ﬁne-tune our models on the nine GLUE tasks over ﬁve epochs and a batch size of 64, where we implement early stopping. We also run automatic hyperparameter tuning for the ﬁne-tuning process using the Tree-structured Parzen Estimator algorithm (Akiba et al., 2019). The learning rate is randomly selected logarithmically in the [2 10 5, 5 10 4] range, and the batch size in {32, 64, 128} uniformly. Table 7 (Table 8) shows the best hyperparameters for ﬁne-tuning of each GLUE (Super GLUE) task selected using this auto-tuning technique.