Optical Transformers
Authors: Maxwell Anderson, Shi-Yuan Ma, Tianyu Wang, Logan Wright, Peter McMahon
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate through a combination of simulations and experiments on prototype optical hardware the feasibility and potential energy benefits of running Transformer models on future optical accelerators that perform matrix-vector multiplication. We use simulations, with noise models validated by small-scale optical experiments, to show that optical accelerators for matrix-vector multiplication should be able to accurately run a typical Transformer architecture model for language processing. We demonstrated linear Transformer operations (the bulk of a Transformer s computation) running with sufficient accuracy on real optical hardware and in a matching simulation, despite errors and noise on hardware supporting fewer than 8 effective bits of precision. |
| Researcher Affiliation | Collaboration | Maxwell G. Anderson EMAIL Department of Applied and Engineering Physics, Cornell University; Shi-Yuan Ma Department of Applied and Engineering Physics, Cornell University; Tianyu Wang Department of Applied and Engineering Physics, Cornell University; Logan G. Wright Department of Applied and Engineering Physics, Cornell University NTT Physics & Informatics Laboratories, NTT Research; Peter L. Mc Mahon EMAIL Department of Applied and Engineering Physics, Cornell University Kavli Institute at Cornell for Nanoscale Science, Cornell University |
| Pseudocode | No | The paper describes methods and equations for optical computing and Transformer models, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, figures, or sections formatted as structured algorithmic steps. |
| Open Source Code | No | The paper does not contain an explicit statement offering access to source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | For language modelling, we used the raw Wikitext-103 dataset (Merity et al., 2017). The optical Transformer models were pretrained on the Wikitext-103 (Merity et al., 2017) dataset and used the same tokenizer as GPT2 (Radford et al., 2019). |
| Dataset Splits | Yes | We evaluated the perplexity over the entire validation set, and ran the model with context length 1024 (the same as in training) and a 1024-token stride length. |
| Hardware Specification | Yes | Our setup is a SLM-based matrix-vector/vector-vector multiplier. The components we used are: Organic light-emitting diode (OLED) display (Google Pixel 2016); Reflective liquid-crystal modulator (1920-500-1100-HDMI, Meadowlark Optics); Half-wave plate (PH10ME-532, Thorlabs); Polarizing beam splitter (CCM1-PBS251, Thorlabs); Zoom lens for imaging onto SLM (Resolv4K, Navitar); Zoom lens and objective lens for imaging onto detector (1-81102, Navita and XLFLUOR4x/340, Olympus); Band-pass filter (FF01-525/15-25, Semroc); Camera for detection (Prime 95B Scientific CMOS Camera, Teledyne Photometrics). For comparison, we also chart various digital systems (Reuther et al., 2020) in different performance regimes, and a hypothetical next generation GPU that can use 10 f J/MAC. For the larger models, MT-NLG-530B and FUTURE-4q, the optics-based approach would have 140 and 8500 energy advantages over the current state-of-the-art GPU (NVIDIA A100) respectively. |
| Software Dependencies | No | The paper mentions optimizers like Adam W (Loshchilov & Hutter, 2019) and RMSProp (Tieleman et al., 2012) and refers to the GPT2 tokenizer, but it does not specify concrete software dependencies with version numbers (e.g., library names like PyTorch or TensorFlow with their respective versions). |
| Experiment Setup | Yes | We created optical Transformer models with a GPT2-like (Radford et al., 2019) architecture that replaces the GELU (Hendrycks & Gimpel, 2016) activation with Re LU6... The models we simulated have 12 layers (consisting of multi-head attention and feed-forward blocks), operate on a context length of 1024 tokens, use 12 attention heads, and have embedding dimension d varying from 192 to 1536. The full details of the training technique, architecture, and hyperparameters are in Appendix A. Appendix A includes tables such as Table 2: Model configurations for optical Transformers, Table 3: Pretraining hyperparameters for optical Transformer models, Table 4: Quantization aware training hyperparameters for optical Transformer models, and Table 5: Hyperparameters for optical Transformer Quantization. |