Optical Transformers

Authors: Maxwell Anderson, Shi-Yuan Ma, Tianyu Wang, Logan Wright, Peter McMahon

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we investigate through a combination of simulations and experiments on prototype optical hardware the feasibility and potential energy benefits of running Transformer models on future optical accelerators that perform matrix-vector multiplication. We use simulations, with noise models validated by small-scale optical experiments, to show that optical accelerators for matrix-vector multiplication should be able to accurately run a typical Transformer architecture model for language processing. We demonstrated linear Transformer operations (the bulk of a Transformer s computation) running with sufficient accuracy on real optical hardware and in a matching simulation, despite errors and noise on hardware supporting fewer than 8 effective bits of precision.
Researcher Affiliation Collaboration Maxwell G. Anderson EMAIL Department of Applied and Engineering Physics, Cornell University; Shi-Yuan Ma Department of Applied and Engineering Physics, Cornell University; Tianyu Wang Department of Applied and Engineering Physics, Cornell University; Logan G. Wright Department of Applied and Engineering Physics, Cornell University NTT Physics & Informatics Laboratories, NTT Research; Peter L. Mc Mahon EMAIL Department of Applied and Engineering Physics, Cornell University Kavli Institute at Cornell for Nanoscale Science, Cornell University
Pseudocode No The paper describes methods and equations for optical computing and Transformer models, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, figures, or sections formatted as structured algorithmic steps.
Open Source Code No The paper does not contain an explicit statement offering access to source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes For language modelling, we used the raw Wikitext-103 dataset (Merity et al., 2017). The optical Transformer models were pretrained on the Wikitext-103 (Merity et al., 2017) dataset and used the same tokenizer as GPT2 (Radford et al., 2019).
Dataset Splits Yes We evaluated the perplexity over the entire validation set, and ran the model with context length 1024 (the same as in training) and a 1024-token stride length.
Hardware Specification Yes Our setup is a SLM-based matrix-vector/vector-vector multiplier. The components we used are: Organic light-emitting diode (OLED) display (Google Pixel 2016); Reflective liquid-crystal modulator (1920-500-1100-HDMI, Meadowlark Optics); Half-wave plate (PH10ME-532, Thorlabs); Polarizing beam splitter (CCM1-PBS251, Thorlabs); Zoom lens for imaging onto SLM (Resolv4K, Navitar); Zoom lens and objective lens for imaging onto detector (1-81102, Navita and XLFLUOR4x/340, Olympus); Band-pass filter (FF01-525/15-25, Semroc); Camera for detection (Prime 95B Scientific CMOS Camera, Teledyne Photometrics). For comparison, we also chart various digital systems (Reuther et al., 2020) in different performance regimes, and a hypothetical next generation GPU that can use 10 f J/MAC. For the larger models, MT-NLG-530B and FUTURE-4q, the optics-based approach would have 140 and 8500 energy advantages over the current state-of-the-art GPU (NVIDIA A100) respectively.
Software Dependencies No The paper mentions optimizers like Adam W (Loshchilov & Hutter, 2019) and RMSProp (Tieleman et al., 2012) and refers to the GPT2 tokenizer, but it does not specify concrete software dependencies with version numbers (e.g., library names like PyTorch or TensorFlow with their respective versions).
Experiment Setup Yes We created optical Transformer models with a GPT2-like (Radford et al., 2019) architecture that replaces the GELU (Hendrycks & Gimpel, 2016) activation with Re LU6... The models we simulated have 12 layers (consisting of multi-head attention and feed-forward blocks), operate on a context length of 1024 tokens, use 12 attention heads, and have embedding dimension d varying from 192 to 1536. The full details of the training technique, architecture, and hyperparameters are in Appendix A. Appendix A includes tables such as Table 2: Model configurations for optical Transformers, Table 3: Pretraining hyperparameters for optical Transformer models, Table 4: Quantization aware training hyperparameters for optical Transformer models, and Table 5: Hyperparameters for optical Transformer Quantization.