AttentionSmithy: A Modular Framework for Rapid Transformer Development
Authors: Caleb Cranney, Jesse G Meyer
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate Attention Smithy by replicating the original Attention Is All You Need transformer under resource constraints, demonstrating robust performance on a machine translation task. Leveraging the package s integrated NAS capability, we identified an optimized model configuration that outperformed our baseline, demonstrating the framework s effectiveness for automated architecture search and model improvement. We further illustrate Attention Smithy s adaptability through gene-specific modeling, where a variant of a BERT-style architecture achieves over 95% accuracy on downstream cell type classification tasks using ranked transcriptomic data. These case studies underscore Attention Smithy s core advantage: enabling specialized experimentation across diverse application domains from natural language processing to genomic analysis by obviating the need for labor-intensive, low-level framework manipulation. |
| Researcher Affiliation | Academia | Caleb Cranney EMAIL Cedars Sinai Medical Center Department of Computational Biomedicine Jesse G. Meyer EMAIL Cedars Sinai Medical Center Department of Computational Biomedicine |
| Pseudocode | Yes | A.2 Short Code Examples The Attention Smithy package provides modular components for constructing custom transformer architectures. It supports encoders, decoders (not shown in this sample), and full encoder-decoder stacks, with swappable attention mechanisms and experimental numeric embedding strategies. A.2.1 Assembling a Transformer Encoder Listing 1: Basic Encoder Assembly 1 from attention_smithy .components import Encoder , Encoder Layer , Multihead Attention , Feed Forward Network 2 from attention_smithy .attention import Standard Attention Method |
| Open Source Code | Yes | The source code for Attention Smithy is publicly available on Git Hub (https://github.com/xomicsdatascience/Attention Smithy). The code implementing machine translation is also available at https://github.com/xomicsdatascience/machine-translation and utilizes the WMT14 German-English dataset (Bojar et al., 2014) accessed through the Hugging Face datasets library. The code implementing geneformer (Theodoris et al., 2023) is available at https://github.com/xomicsdatascience/geneformer, utilizing preprocessed data from the original geneformer implementation Hugging Face repository. All code repositories are released under the MIT license. |
| Open Datasets | Yes | We implemented the transformer architecture and training setup described in Vaswani et al. (2023), using the WMT 2014 English-German dataset, which consists of 4.51M sentence pairs (approximately 9.03M sentences total). Our primary training run was limited to a maximum context window of 100 tokens, mainly to reduce training time given the use of a single A100 GPU. While this constraint was partly driven by computational efficiency, it was also informed by the nature of the dataset: only 50,860 sentence pairs (approximately 1.1%) included at least one sentence longer than 100 tokens and were excluded from training. To confirm that this truncation had minimal effect on overall performance, we conducted an additional run using a 500-token context window (see Appendix A.3). |
| Dataset Splits | No | We implemented the transformer architecture and training setup described in Vaswani et al. (2023), using the WMT 2014 English-German dataset, which consists of 4.51M sentence pairs (approximately 9.03M sentences total). We used their published human_dcm_hcm_nf dataset for this task, which contains 579,159 cells representing 21 distinct cell types from cardiac tissue from 29 individuals. (Explanation: The paper mentions the datasets and total sizes but does not provide specific details on how these datasets were split into training, validation, or test sets, nor does it cite a source for predefined splits.) |
| Hardware Specification | Yes | Our primary training run was limited to a maximum context window of 100 tokens, mainly to reduce training time given the use of a single A100 GPU. |
| Software Dependencies | No | Attention Smithy is implemented in Python using Py Torch (Ansel et al., 2024). To enhance usability and standardization, Attention Smithy is designed to be compatible with Py Torch Lightning (Falcon, 2019), allowing researchers to easily incorporate training loops, distributed training, and other advanced features while maintaining clean, research-focused code. (Explanation: While PyTorch is mentioned with a reference indicating PyTorch 2, specific version numbers for other key software components like Ax package or PyTorch Lightning are not provided.) |
| Experiment Setup | Yes | Models were trained for five epochs during the search to reduce time complexity. To demonstrate how domain experts can apply NAS to specialized applications using Attention Smithy, we extended the NAS workflow to the Geneformer task. The search space included three positional encoding strategies (sinusoidal, learned, and rotary), dropout rate, activation function, and attention mechanism. Additionally, nonstandard attention types included task-specific hyperparameters: Longformer (context window length), Linformer (projected key dimension), and Perceiver (number of latent encoder layers and latent space length). Each trial was run for six thousand steps with a batch size of 32. |