DEPfold: RNA Secondary Structure Prediction as Dependency Parsing.
Authors: Ke Wang, Shay B Cohen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DEPfold on both within-family and cross-family RNA datasets, demonstrating significant performance improvements over existing methods. DEPfold shows strong performance in cross-family generalization when trained on data augmented by traditional energy-based models, outperforming existing methods on the bp RNAnew dataset. |
| Researcher Affiliation | Academia | Ke Wang Shay B. Cohen School of Informatics, The University of Edinburgh EMAIL |
| Pseudocode | Yes | The pseudocode can be found in Appendix A, Algorithm 2. A PSEUDOCODE FOR RNA SECONDARY STRUCTURE TO DEPENDENCY STRUCTURE To clearly explain the algorithmic logic for converting RNA secondary structures into dependency structures, we present the following pseudocode. Algorithm 1 is the main program, which uses the get pair function defined in Algorithm 2 to generate binary tree structures from stem and pseudoknot sequences. The Get Pair function, during its processing, utilizes the Is Connect function defined in Algorithm 3 for decision-making. When handling unpaired structures, the algorithm employs the get pairs function defined in Algorithm 4. |
| Open Source Code | Yes | 1Our code is available at https://github.com/Vicky-0256/DEPfold.git. |
| Open Datasets | Yes | Dataset We evaluate DEPfold on four widely-used RNA structure prediction benchmark datasets: RNAStr Align (Tan et al., 2017) contains 37,149 structures from 8 RNA families. ... Archive II (Sloma & Mathews, 2016), comprising 3,975 structures from 10 RNA families, serves as a standard benchmark for classical RNA folding methods. ... bp RNA-1m (Singh et al., 2019) includes 102,318 structures from 2,588 RNA families. ... bp RNA-new (Kalvari et al., 2017), derived from Rfam 14.2, contains sequences from 1,500 novel RNA families and is used to assess cross-family generalization. |
| Dataset Splits | Yes | Table 1: Summary of datasets used in our experiments. Dataset Subset #Seq. Len. Range RNAStr Align Train 28,969 30 1581 Val 3,629 36 1693 Test 2,810 57 1672 bp RNA-1m TR0 10,814 33 498 VL0 1,300 33 497 TS0 1,305 22 499 |
| Hardware Specification | Yes | All experiments were conducted on four NVIDIA A100-40GB GPUs, enabling efficient training and scalability. |
| Software Dependencies | No | The code used in DEPfold primarily draws from parts of the Su Par (Zhang et al., 2020a;b) Git Hub repository (https://github.com/yzhangcs/parser.git). We implemented DEPfold using Py Torch. The architecture uses Ro BERTa-base as the encoder within a biaffine framework. Specifically, the model uses the first four layers of Ro BERTa-base, applying mean pooling to generate a 768-dimensional representation. ... For optimization, we used the Adam W optimizer... |
| Experiment Setup | Yes | To mitigate overfitting, we applied a dropout rate of 0.1 to the encoder outputs and a dropout rate of 0.33 to the MLP layers. For optimization, we used the Adam W optimizer with a dual learning rate strategy: the encoder parameters were assigned a learning rate of 5 10 5, while the non-encoder parameters were set to 1 10 3. ... During training, we used a batch size of 32 to maximize GPU use. The training process was capped at 100 epochs, incorporating an early stopping mechanism based on the F1 score on the validation set. |