DEPfold: RNA Secondary Structure Prediction as Dependency Parsing.

Authors: Ke Wang, Shay B Cohen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate DEPfold on both within-family and cross-family RNA datasets, demonstrating significant performance improvements over existing methods. DEPfold shows strong performance in cross-family generalization when trained on data augmented by traditional energy-based models, outperforming existing methods on the bp RNAnew dataset.
Researcher Affiliation Academia Ke Wang Shay B. Cohen School of Informatics, The University of Edinburgh EMAIL
Pseudocode Yes The pseudocode can be found in Appendix A, Algorithm 2. A PSEUDOCODE FOR RNA SECONDARY STRUCTURE TO DEPENDENCY STRUCTURE To clearly explain the algorithmic logic for converting RNA secondary structures into dependency structures, we present the following pseudocode. Algorithm 1 is the main program, which uses the get pair function defined in Algorithm 2 to generate binary tree structures from stem and pseudoknot sequences. The Get Pair function, during its processing, utilizes the Is Connect function defined in Algorithm 3 for decision-making. When handling unpaired structures, the algorithm employs the get pairs function defined in Algorithm 4.
Open Source Code Yes 1Our code is available at https://github.com/Vicky-0256/DEPfold.git.
Open Datasets Yes Dataset We evaluate DEPfold on four widely-used RNA structure prediction benchmark datasets: RNAStr Align (Tan et al., 2017) contains 37,149 structures from 8 RNA families. ... Archive II (Sloma & Mathews, 2016), comprising 3,975 structures from 10 RNA families, serves as a standard benchmark for classical RNA folding methods. ... bp RNA-1m (Singh et al., 2019) includes 102,318 structures from 2,588 RNA families. ... bp RNA-new (Kalvari et al., 2017), derived from Rfam 14.2, contains sequences from 1,500 novel RNA families and is used to assess cross-family generalization.
Dataset Splits Yes Table 1: Summary of datasets used in our experiments. Dataset Subset #Seq. Len. Range RNAStr Align Train 28,969 30 1581 Val 3,629 36 1693 Test 2,810 57 1672 bp RNA-1m TR0 10,814 33 498 VL0 1,300 33 497 TS0 1,305 22 499
Hardware Specification Yes All experiments were conducted on four NVIDIA A100-40GB GPUs, enabling efficient training and scalability.
Software Dependencies No The code used in DEPfold primarily draws from parts of the Su Par (Zhang et al., 2020a;b) Git Hub repository (https://github.com/yzhangcs/parser.git). We implemented DEPfold using Py Torch. The architecture uses Ro BERTa-base as the encoder within a biaffine framework. Specifically, the model uses the first four layers of Ro BERTa-base, applying mean pooling to generate a 768-dimensional representation. ... For optimization, we used the Adam W optimizer...
Experiment Setup Yes To mitigate overfitting, we applied a dropout rate of 0.1 to the encoder outputs and a dropout rate of 0.33 to the MLP layers. For optimization, we used the Adam W optimizer with a dual learning rate strategy: the encoder parameters were assigned a learning rate of 5 10 5, while the non-encoder parameters were set to 1 10 3. ... During training, we used a batch size of 32 to maximize GPU use. The training process was capped at 100 epochs, incorporating an early stopping mechanism based on the F1 score on the validation set.