reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bytes Are All You Need: Transformers Operating Directly On File Bytes

Authors: Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our model, Byte Former, improves Image Net Top-1 classification accuracy by 5% (from 72.2% to 77.33%) relative to De IT models of similar size. Compared to Perceiver IO, our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on Image Net. We demonstrate that the same Byte Former architecture can perform audio classification without modifications or modality-specific preprocessing. We achieve 95.42% classification accuracy on the Speech Commands V2 dataset (comparable to the state-of-the-art accuracy of 98.7%).
Researcher Affiliation	Collaboration	Maxwell Horton EMAIL Sachin Mehta Apple Ali Farhadi Allen Institute for Artificial Intelligence Mohammad Rastegari
Pseudocode	No	The paper describes the Byte Former architecture and its components (byte embeddings, strided 1D convolution, positional embeddings, shifted window attention, token downsampling layers) in prose within Section 3, 'Our Architecture and Implementation', and Section 3.1 'Byte Former', without providing any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release our code at https://github.com/apple/corenet/tree/main/projects/byteformer.
Open Datasets	Yes	We demonstrate the efficacy of Byte Former on Image Net (Deng et al., 2009) classification... We achieve 95.42% classification accuracy on the Speech Commands V2 dataset (Warden, 2018).
Dataset Splits	Yes	For example, for TIFF experiments on Image Net, we precompute 224 224 crops of the validation images and save them in the TIFF format. Similarly, for audio classification, we re-encode the audio clips in Speech Commands V2 into the desired format before validation.
Hardware Specification	Yes	For Image Net, we use a batch size of 48 on a single machine equipped with 8 NVIDIA A100 GPUs... We train our models on 4 NVidia A100 GPU machines... Im/s: Throughput (images/sec) on an A100 80GB Nvidia GPU.
Software Dependencies	No	The paper mentions software packages like PIL (Clark, 2015), scipy (Virtanen et al., 2020), CVNets (Mehta et al., 2022), and pydub (Robert et al., 2018), but it does not specify version numbers for these dependencies, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	For Image Net, we use a batch size of 48 on a single machine equipped with 8 NVIDIA A100 GPUs. At training time, we use random resized cropping, random horizontal flipping, Rand Augment (Cubuk et al., 2019), and Random Erase (Zhong et al., 2017)... We train with Adam W (Loshchilov & Hutter, 2017) with weight decay 0.05, and a cosine learning rate schedule that anneals the learning rate from 0.001 to 0.00002, with 7500 warmup iterations.