Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

Authors: Junjie Xu, Artem Moskalev, Tommaso Mansi, Mangal Prakash, Rui Liao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This study presents the first systematic evaluation of incorporating explicit 2D and 3D geometric information into RNA property prediction, considering not only performance but also real-world challenges such as limited data availability, partial labeling, sequencing noise, and computational efficiency. To this end, we introduce a newly curated set of RNA datasets with enhanced 2D and 3D structural annotations, providing a resource for model evaluation on RNA data. Our findings reveal that models with explicit geometry encoding generally outperform sequence-based models, with an average prediction RMSE reduction of around 12% across all various RNA tasks and excelling in low-data and partial labeling regimes, underscoring the value of explicitly incorporating geometric context.
Researcher Affiliation Collaboration Junjie Xu1,2 , Artem Moskalev1 , Tommaso Mansi1, Mangal Prakash1 , Rui Liao1 1Johnson & Johnson Innovative Medicine, 2The Pennsylvania State University EMAIL EMAIL
Pseudocode No The paper describes model architectures and training procedures but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No All datasets and code will be released upon acceptance.
Open Datasets Yes The datasets vary in size based on the number of sequences and sequence lengths: the small dataset Tc-Riboswitches (Groher et al., 2018), the medium datasets Open Vaccine COVID-19 (Wayment Steele et al., 2022b) and Ribonanza-2k (He et al., 2024), and the large dataset Fungal (Wint et al., 2022).
Dataset Splits Yes We ran all models for 5 random data splits (train:val:test split of 70:15:15) and we report average performance with a standard deviation across splits.
Hardware Specification Yes All models were trained on a NVIDIA A100 GPU.
Software Dependencies No Most baseline implementations were sourced from Py Torch Geometric (Fey & Lenssen, 2019). The Transformer1D model was adapted to Transformer1D2D as detailed in the paper. For EGNN, we utilized the authors implementation (Satorras et al., 2021), and for Sch Net, the implementation from (Joshi et al., 2023) was used. However, specific version numbers for these software dependencies are not provided.
Experiment Setup Yes To ensure hyperparameter parity for each baseline, hyperparameters were optimized using Optuna (Akiba et al., 2019), restricting the search to models with fewer than 10 million parameters that fit within the GPU memory constraint of 80GB. All model hyperparameters, training, and evaluation details are reported in Appendix H. We ran all models for 5 random data splits (train:val:test split of 70:15:15) and we report average performance with a standard deviation across splits.