reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

Authors: Junjie Xu, Artem Moskalev, Tommaso Mansi, Mangal Prakash, Rui Liao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study presents the first systematic evaluation of incorporating explicit 2D and 3D geometric information into RNA property prediction, considering not only performance but also real-world challenges such as limited data availability, partial labeling, sequencing noise, and computational efficiency. To this end, we introduce a newly curated set of RNA datasets with enhanced 2D and 3D structural annotations, providing a resource for model evaluation on RNA data. Our findings reveal that models with explicit geometry encoding generally outperform sequence-based models, with an average prediction RMSE reduction of around 12% across all various RNA tasks and excelling in low-data and partial labeling regimes, underscoring the value of explicitly incorporating geometric context.
Researcher Affiliation	Collaboration	Junjie Xu1,2 , Artem Moskalev1 , Tommaso Mansi1, Mangal Prakash1 , Rui Liao1 1Johnson & Johnson Innovative Medicine, 2The Pennsylvania State University EMAIL EMAIL
Pseudocode	No	The paper describes model architectures and training procedures but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	All datasets and code will be released upon acceptance.
Open Datasets	Yes	The datasets vary in size based on the number of sequences and sequence lengths: the small dataset Tc-Riboswitches (Groher et al., 2018), the medium datasets Open Vaccine COVID-19 (Wayment Steele et al., 2022b) and Ribonanza-2k (He et al., 2024), and the large dataset Fungal (Wint et al., 2022).
Dataset Splits	Yes	We ran all models for 5 random data splits (train:val:test split of 70:15:15) and we report average performance with a standard deviation across splits.
Hardware Specification	Yes	All models were trained on a NVIDIA A100 GPU.
Software Dependencies	No	Most baseline implementations were sourced from Py Torch Geometric (Fey & Lenssen, 2019). The Transformer1D model was adapted to Transformer1D2D as detailed in the paper. For EGNN, we utilized the authors implementation (Satorras et al., 2021), and for Sch Net, the implementation from (Joshi et al., 2023) was used. However, specific version numbers for these software dependencies are not provided.
Experiment Setup	Yes	To ensure hyperparameter parity for each baseline, hyperparameters were optimized using Optuna (Akiba et al., 2019), restricting the search to models with fewer than 10 million parameters that fit within the GPU memory constraint of 80GB. All model hyperparameters, training, and evaluation details are reported in Appendix H. We ran all models for 5 random data splits (train:val:test split of 70:15:15) and we report average performance with a standard deviation across splits.