Beyond Sequence: Impact of Geometric Context for RNA Property Prediction
Authors: Junjie Xu, Artem Moskalev, Tommaso Mansi, Mangal Prakash, Rui Liao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This study presents the first systematic evaluation of incorporating explicit 2D and 3D geometric information into RNA property prediction, considering not only performance but also real-world challenges such as limited data availability, partial labeling, sequencing noise, and computational efficiency. To this end, we introduce a newly curated set of RNA datasets with enhanced 2D and 3D structural annotations, providing a resource for model evaluation on RNA data. Our findings reveal that models with explicit geometry encoding generally outperform sequence-based models, with an average prediction RMSE reduction of around 12% across all various RNA tasks and excelling in low-data and partial labeling regimes, underscoring the value of explicitly incorporating geometric context. |
| Researcher Affiliation | Collaboration | Junjie Xu1,2 , Artem Moskalev1 , Tommaso Mansi1, Mangal Prakash1 , Rui Liao1 1Johnson & Johnson Innovative Medicine, 2The Pennsylvania State University EMAIL EMAIL |
| Pseudocode | No | The paper describes model architectures and training procedures but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | All datasets and code will be released upon acceptance. |
| Open Datasets | Yes | The datasets vary in size based on the number of sequences and sequence lengths: the small dataset Tc-Riboswitches (Groher et al., 2018), the medium datasets Open Vaccine COVID-19 (Wayment Steele et al., 2022b) and Ribonanza-2k (He et al., 2024), and the large dataset Fungal (Wint et al., 2022). |
| Dataset Splits | Yes | We ran all models for 5 random data splits (train:val:test split of 70:15:15) and we report average performance with a standard deviation across splits. |
| Hardware Specification | Yes | All models were trained on a NVIDIA A100 GPU. |
| Software Dependencies | No | Most baseline implementations were sourced from Py Torch Geometric (Fey & Lenssen, 2019). The Transformer1D model was adapted to Transformer1D2D as detailed in the paper. For EGNN, we utilized the authors implementation (Satorras et al., 2021), and for Sch Net, the implementation from (Joshi et al., 2023) was used. However, specific version numbers for these software dependencies are not provided. |
| Experiment Setup | Yes | To ensure hyperparameter parity for each baseline, hyperparameters were optimized using Optuna (Akiba et al., 2019), restricting the search to models with fewer than 10 million parameters that fit within the GPU memory constraint of 80GB. All model hyperparameters, training, and evaluation details are reported in Appendix H. We ran all models for 5 random data splits (train:val:test split of 70:15:15) and we report average performance with a standard deviation across splits. |