AtomSurf: Surface Representation for Learning on Protein Structures
Authors: Vincent Mallet, Yangyang Miao, Souhaib Attaiki, Bruno Correia, Maks Ovsjanikov
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then perform a direct and fair comparison of the resulting method against alternative approaches within the Atom3D benchmark, highlighting the limitations of pure surface-based learning. Finally, we propose an integrated approach, which allows learned feature sharing between graphs and surface representations on the level of nodes and vertices across all layers. We demonstrate that the resulting architecture achieves state-of-the-art results on all tasks in the Atom3D benchmark, as well as more broadly on binding site identification and binding pocket classification. We start by validating our surface encoder on the RNA segmentation benchmark (Poulenard et al., 2019) for surface methods. We assess the impact of the proposed enhancements to Diffusion Net by showing the learning curves of the enhanced models on the RNA segmentation task (see Figure 2 and Appendix F.1 for a similar analysis on PSR). Moreover, we compare their performance to other recent surface encoders, DGCNN (Wang et al., 2019) and Delta Conv (Wiersma et al., 2022) and report results in Table 1. |
| Researcher Affiliation | Academia | 1 LIX, Ecole Polytechnique, IPP Paris, Paris, France 2 Mines Paris, PSL Research University, CBIO, Paris, France 3 Institut Curie, PSL Research University, Paris, France 4 INSERM, U900, Paris, France 5 Institute of Bioengineering, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland EMAIL EMAIL |
| Pseudocode | No | The paper describes methods and architectures in prose and through diagrams (Figure 1), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code can be found online: github.com/Vincentx15/atomsurf |
| Open Datasets | Yes | We then perform a direct and fair comparison of the resulting method against alternative approaches within the Atom3D benchmark, highlighting the limitations of pure surface-based learning. We start by validating our surface encoder on the RNA segmentation benchmark (Poulenard et al., 2019) for surface methods. We start by evaluating our proposed approach on the task of ligand-binding preference prediction for protein binding sites, introduced in (Gainza et al., 2020). Pegoraro et al. (2024) introduced a dataset containing 235 antibody antigen complexes. In addition to this relatively small dataset, we validate our approach on the recently proposed, large-scale PINDER dataset (Kovtun et al., 2024). |
| Dataset Splits | Yes | The dataset comprises 87k, 31k, and 15k training, validation, and test examples, split based on a 30% sequence identity. This task includes 2864, 937, and 347 examples in each data split, and the splitting is performed based on a 30% sequence identity. The PSR data train, validation, and test splits hold 25.4k, 2.8k, and 16k systems respectively. Splits correspond to a time split, with more recent CASP competition belonging to the test split. |
| Hardware Specification | Yes | This work was performed using HPC resources from GENCI IDRIS (Grant 2023-AD010613356) and CITAS at EPFL. training takes between a few hours and up to four days (for PIP which is our largest data set) on a standard setting; 4 CPU workers and a single GPU such as NVIDIA V100. |
| Software Dependencies | No | All models rely on a Py Torch Geometric (Fey & Lenssen, 2019) implementation. We employ our modified version of Diffusion Net (by adapting the original implementation provided by the authors1) for each surface encoder and utilize GCN for the graph networks (using the implementation provided by Py Torch Geometric2). (No specific version numbers are provided for PyTorch Geometric, Diffusion Net, or GCN.) |
| Experiment Setup | Yes | Our networks were trained in accordance with the parameter counts of other methods, strictly adhering to their optimization protocols, including the number of epochs, learning rate, and batch size. For the surface methods, we consistently used 3 blocks, with 94, 90, and 96 channels for the PIP, MSP, and PSR tasks, respectively. For the bipartite methods, on the PIP task, we utilized 4 blocks with a width of 118; for MSP, 3 blocks with a width of 148; and for PSR, 4 blocks with a width of 160. On the binding site classification task, our architecture featured 6 blocks, each with a width of 128. |