White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
Authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments to study the empirical performance of our proposed white-box transformer crate on real-world datasets and tasks. ... First, we verify convincingly that the white-box transformer architecture, crate is practically effective and can achieve strong performance on many large-scale realworld datasets and tasks. These include supervised and self-supervised learning tasks on both vision and natural language data: Vi T, MAE, DINO, BERT, and GPT. |
| Researcher Affiliation | Academia | Yaodong Yu , EMAIL Sam Buchanan , EMAIL Druv Pai , EMAIL Tianzhe Chu , EMAIL Ziyang Wu EMAIL Shengbang Tong EMAIL Hao Bai EMAIL Yuexiang Zhai EMAIL Benjamin D. Haeffele EMAIL Yi Ma , EMAIL, EMAIL University of California, Berkeley Toyota Technological Institute at Chicago University of Illinois, Urbana-Champaign Johns Hopkins University University of Hong Kong |
| Pseudocode | Yes | Appendix D gives Py Torch-like pseudocode for our implementation of crate. D Py Torch code for CRATE D.1 Py Torch-Like Pseudocode for MSSA and ISTA Blocks D.2 Py Torch-Like Pseudocode for CRATE Encoder D.3 Py Torch-Like Pseudocode for CRATE Decoder D.4 Py Torch-Like Pseudocode for CRATE Image Classifier |
| Open Source Code | Yes | Code is available at: https://ma-lab-berkeley.github.io/CRATE. |
| Open Datasets | Yes | We pre-train on Image Net-1K (Deng et al., 2009), using the Lion optimizer (Chen et al., 2023c). For fine-tuning (possibly on a different dataset with a different number of classes), we re-initialize fhead using parameters with the appropriate value of C, and train on the cross-entropy loss in (68), updating both f and fhead. Again, we train using the Lion optimizer (Chen et al., 2023c), this time on a variety of commonly used datasets: CIFAR10/CIFAR100 (Krizhevsky et al., 2009), Oxford Flowers (Nilsback and Zisserman, 2008), and Oxford-IIIT-Pets (Parkhi et al., 2012). Masked autoencoding (He et al., 2022) masks out a large percentage of randomly selected input image tokens in the input X = x1, . . . , x N RD N and then attempts to reconstruct the whole image, measuring success by the resulting autoencoding reconstruction loss and performance on downstream tasks. We use the same pretraining dataset as BERT (Devlin et al., 2019), including Books Corpus (Zhu et al., 2015) and English Wikipedia. We pre-train crate-GPT models on Open Web Text (Gokaslan and Cohen, 2019) using the Adam optimizer (Kingma and Ba, 2015). We apply the Mask Cut (Wang et al., 2023b) pipeline on the COCO val2017 (Lin et al., 2014), which consists of 5,000 RGB images, and assess our models performance for both object detection and instance segmentation tasks. |
| Dataset Splits | Yes | We pre-train on Image Net-1K (Deng et al., 2009), using the Lion optimizer (Chen et al., 2023c). For fine-tuning (possibly on a different dataset with a different number of classes), we re-initialize fhead using parameters with the appropriate value of C, and train on the cross-entropy loss in (68), updating both f and fhead. Again, we train using the Lion optimizer (Chen et al., 2023c), this time on a variety of commonly used datasets: CIFAR10/CIFAR100 (Krizhevsky et al., 2009), Oxford Flowers (Nilsback and Zisserman, 2008), and Oxford-IIIT-Pets (Parkhi et al., 2012). We use the same pretraining dataset as BERT (Devlin et al., 2019), including Books Corpus (Zhu et al., 2015) and English Wikipedia. For all the tasks in GLUE, we use a learning rate of 2 * 10^-5 without any hyperparameter sweep. ... We fine-tune for 8 epochs on MNLI, for 5 epochs on WNLI and MRPC (because these two datasets are tiny), and for 3 epochs on all other tasks. We use the test split as the validation set for all tasks. |
| Hardware Specification | Yes | One training epoch of crate-Base takes around 240 seconds using 16 A100 40GB GPUs. |
| Software Dependencies | No | We apply the Lion optimizer (Chen et al., 2023c) for pre-training both crate and Vi T models. For each fine-tuning task, we employ the Adam W optimizer (Loshchilov and Hutter, 2019). We apply the Adam W optimizer (Loshchilov and Hutter, 2019) for pre-training both crate-MAE models on Image Net-1K. we apply the optimization solver in scikit-learn, i.e., linear model.Logistic Regression, to learn a logistic regression model with ℓ2 regularization, and use cross-validation to select the ℓ2 regularization parameter for each model-dataset pair. |
| Experiment Setup | Yes | We configure the learning rate as 2.4 * 10^-4, weight decay as 0.5, and batch size as 2,048. We incorporate a warm-up strategy with a linear increase over 5 epochs, followed by training the models for a total of 150 epochs with cosine decay. For data augmentation, we only apply the standard techniques, random cropping and random horizontal flipping, on the Image Net-1K dataset. We apply label smoothing with smoothing parameter 0.1. For each fine-tuning task, we use the Adam W optimizer (Loshchilov and Hutter, 2019). We configure the learning rate as 5 * 10^-5, weight decay as 0.01, and batch size as 512. We use a batch size of 8,096 and train for 30,000 steps with the Adam optimizer (Kingma and Ba, 2015). For the Adam optimizer, we use (β1, β2) = (0.9, 0.98), and a weight decay of 0.01. For the learning rate scheduler, we apply the linear warm-up and linear decay, with the peak learning rate at the iteration of 1,800 steps with value η = 10^-3. |