Bytes Are All You Need: Transformers Operating Directly On File Bytes
Authors: Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model, Byte Former, improves Image Net Top-1 classification accuracy by 5% (from 72.2% to 77.33%) relative to De IT models of similar size. Compared to Perceiver IO, our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on Image Net. We demonstrate that the same Byte Former architecture can perform audio classification without modifications or modality-specific preprocessing. We achieve 95.42% classification accuracy on the Speech Commands V2 dataset (comparable to the state-of-the-art accuracy of 98.7%). |
| Researcher Affiliation | Collaboration | Maxwell Horton EMAIL Sachin Mehta Apple Ali Farhadi Allen Institute for Artificial Intelligence Mohammad Rastegari |
| Pseudocode | No | The paper describes the Byte Former architecture and its components (byte embeddings, strided 1D convolution, positional embeddings, shifted window attention, token downsampling layers) in prose within Section 3, 'Our Architecture and Implementation', and Section 3.1 'Byte Former', without providing any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our code at https://github.com/apple/corenet/tree/main/projects/byteformer. |
| Open Datasets | Yes | We demonstrate the efficacy of Byte Former on Image Net (Deng et al., 2009) classification... We achieve 95.42% classification accuracy on the Speech Commands V2 dataset (Warden, 2018). |
| Dataset Splits | Yes | For example, for TIFF experiments on Image Net, we precompute 224 224 crops of the validation images and save them in the TIFF format. Similarly, for audio classification, we re-encode the audio clips in Speech Commands V2 into the desired format before validation. |
| Hardware Specification | Yes | For Image Net, we use a batch size of 48 on a single machine equipped with 8 NVIDIA A100 GPUs... We train our models on 4 NVidia A100 GPU machines... Im/s: Throughput (images/sec) on an A100 80GB Nvidia GPU. |
| Software Dependencies | No | The paper mentions software packages like PIL (Clark, 2015), scipy (Virtanen et al., 2020), CVNets (Mehta et al., 2022), and pydub (Robert et al., 2018), but it does not specify version numbers for these dependencies, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | For Image Net, we use a batch size of 48 on a single machine equipped with 8 NVIDIA A100 GPUs. At training time, we use random resized cropping, random horizontal flipping, Rand Augment (Cubuk et al., 2019), and Random Erase (Zhong et al., 2017)... We train with Adam W (Loshchilov & Hutter, 2017) with weight decay 0.05, and a cosine learning rate schedule that anneals the learning rate from 0.001 to 0.00002, with 7500 warmup iterations. |