A Primal-Dual Framework for Transformers and Neural Networks
Authors: Tan Minh Nguyen, Tam Minh Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard Baraniuk, Stanley Osher
ICLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model s accuracy, and improving the model s efficiency in a variety of practical applications including image and time-series classification. |
| Researcher Affiliation | Academia | Tan M. Nguyen* Department of Mathematics University of California, Los Angeles EMAIL Tam Nguyen* Department of ECE Rice University EMAIL Nhat Ho Department of Statistics & Data Sciences University of Texas at Austin EMAIL Andrea L. Bertozzi Department of Mathematics University of California, Los Angeles EMAIL Richard G. Baraniuk** Department of ECE Rice University EMAIL Stanley J. Osher** Department of Mathematics University of California, Los Angeles EMAIL |
| Pseudocode | No | No clearly labeled pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Implementation available at https://github.com/thuml/Flowformer. |
| Open Datasets | Yes | We empirically demonstrate the advantages of our Attention-BN, Attention-SH, and their combination (Attention-BN+SH) over the baseline softmax attention on the UEA timeseries classification benchmark (Bagnall et al., 2018), the Long Range Arena benchmark (Tay et al., 2021), and the image classification task on the Imagenet dataset (Deng et al., 2009; Russakovsky et al., 2015). |
| Dataset Splits | Yes | The Image Net dataset (Deng et al., 2009; Russakovsky et al., 2015) consists of 1.28M training images and 50K validation images. |
| Hardware Specification | Yes | All of our experiments are conducted on a server with 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper refers to third-party implementations and their respective GitHub repositories but does not explicitly list the specific versions of programming languages or software libraries used in their own experimental setup. |
| Experiment Setup | Yes | In our experiments, we consider the constant β in Attention-BN/BN+SH and the different downsampling scales in Attention-SH/SH+BN as hyper-parameters to finetune. All of our experiments are conducted on a server with 4 NVIDIA A100 GPUs. In all models, the number of heads is 8, whereas the model dimension and number of transformer layers are varied. For Attention-SH/SH+BN, we downsample keys and values by the factor of 2, after every two successive heads. |