Transformers trained on proteins can learn to attend to Euclidean distance
Authors: Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte Deane
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To test our theory of how Transformers learn to measure distance, we designed a Transformer encoder which is truncated such that the output is the unnormalized attention matrix for a single head. A diagram of this model is shown in Figure 1. We computed the loss as the l1 difference between the output matrix and the matrix Ai,j = e( (xi xj)2 2002 ). This corresponds to the prenormalized softmax of the negative square of the relative distance between points, as predicted by our theory. The data consisted of 10,000 structures , each with five 3-dimensional points with coordinates randomly selected between 0 and 200. For all protein experiments we used the GO PDB dataset from Deep FRI (Gligorijević et al., 2021) which comprises 36K protein chains. As shown in Figure 4a, adding coordinates substantially improved the model, leading to a final training perplexity of 6.5 with coordinates vs 11.9 without. We also investigated the difference in sequence recovery rates between the two models. The total sequence recovery rate was 23% for the non-coords model compared to 38% for the coords model. Figure 4b shows a breakdown of the recovery rates per amino acid type. Finally, we tested whether the pretrained protein model embeddings could improve accuracy on a downstream task. We trained models to predict protein molecular function Gene Ontology labels (Ashburner et al., 2000). We also tested whether the imbued structural information could provide benefit in zero-shot protein property prediction on the Protein Gym benchmark (Notin et al., 2023). We trained a contact prediction head on top of our model (Table 3) and found that the accuracy across all distances was nearly 100%. |
| Researcher Affiliation | Collaboration | Isaac Ellmen Department of Statistics University of Oxford Constantin Schneider Now at Xyme Matthew I.J. Raybould Department of Statistics University of Oxford Charlotte M. Deane EMAIL Department of Statistics University of Oxford |
| Pseudocode | No | The paper describes methods and models through textual descriptions and diagrams (e.g., Figure 1: "Overview of the simulated experiment model.") but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at: https://github.com/oxpig/attending-to-distance. Code to replicate the experiments is available on Git Hub at: https://github.com/oxpig/attending-to-distance. This includes code for models, data download/processing, model training, and evaluation. |
| Open Datasets | Yes | For all protein experiments we used the GO PDB dataset from Deep FRI (Gligorijević et al., 2021) which comprises 36K protein chains. We also tested whether the imbued structural information could provide benefit in zero-shot protein property prediction on the Protein Gym benchmark (Notin et al., 2023). |
| Dataset Splits | Yes | We clustered the data by 50% sequence identity using MMSeqs2 (Steinegger & Söding, 2017) and randomly held out 1% of the clusters to use as a validation set. We train two models based on the pretrained models from the previous section... We train two versions (one with coordinates and one without) on the Alpha Fold-DB Swiss Prot subset of 500k structures (Varadi et al., 2024). We retained 0.5% of the dataset for validation on the pretraining perplexity. |
| Hardware Specification | Yes | All models were trained on a single NVIDIA Quadro RTX 6000 with 24GB of memory. |
| Software Dependencies | No | The paper does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | The models were trained with a batch size of 16 using the Adam optimizer with a peak learning rate of 4 10 4 which is reached after 4,000 warmup steps, and then is quadratically decayed. We trained each model for 100 epochs with a fixed batch size of 24, resulting in approximately 150K updates. We used the Adam optimizer with 4,000 warmup steps to a peak learning rate of 2.3 10 4, followed by inverse square decay. Each model (coords/no coords) is finetuned for 20 epochs with a constant learning rate of 3 10 5. We trained each model for 40 epochs with a fixed batch size of 16. We used the Adam optimizer with 4,000 warmup steps to a peak learning rate of 3.6 10 4, followed by inverse square decay. |