Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Blind Biological Sequence Denoising with Self-Supervised Set Learning

Authors: Nathan Hoyen Ng, Ji Won Park, Jae Hyeon Lee, Ryan Lewis Kelly, Stephen Ra, Kyunghyun Cho

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments on simulated long-read DNA data, SSSL methods denoise small reads of 6 subreads with 17% fewer errors and large reads of > 6 subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set.
Researcher Affiliation Collaboration Nathan Ng1,2,3, Ji Won Park4 Jae Hyeon Lee4 Ryan Lewis Kelly4 Stephen Ra4 Kyunghyun Cho4,5,6 EMAIL, EMAIL, EMAIL 1University of Toronto 2Vector Institute 3MIT 4Prescient Design, Genentech 5New York University 6CIFAR Fellow
Pseudocode No The paper describes the model framework and training objective in text and diagrams (Figure 1 and Figure 2), but does not provide a dedicated pseudocode block or algorithm listing.
Open Source Code No The paper mentions specific model and training hyperparameters are provided in Appendix B and references third-party tools like PBSIM2, MAFFT, MUSCLE, and T-Coffee. However, there is no explicit statement about making the authors' own implementation code or models publicly available, nor is there a link to a code repository.
Open Datasets No The paper states: 'We generate a set of 10,000 source sequences S using a procedure that mimics V-J recombination... Finally, using the PBSIM2 (Ono et al., 2020) long read simulator with an error profile mimicking the R9.5 Oxford Nanopore Flow Cell, we generate a read from s(i) with m(i) subreads.' and 'To investigate our model s ability to denoise real data, we use a proprietary experimental sc Fv antibody library sequenced with ONT.' These indicate simulated data and a proprietary dataset, not publicly available ones. The paper further notes: 'Although additional experiments on public datasets of similar size or heavy chain data would bolster our results, to our knowledge no real long read sequencing datasets exist at the scale of our sc Fv antibody dataset.'
Dataset Splits Yes We split these reads into a training, validation, and test set with a 90%/5%/5% split, respectively. ... As before, we split our data randomly into a training, validation, and test set by randomly sampling 90%/5%/5% of the data respectively.
Hardware Specification No The paper describes the model architecture, training process, and hyperparameter settings but does not explicitly mention any specific hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions using specific algorithms and models like 'transformer encoder and decoder', 'Adam optimizer (Kingma & Ba, 2014)', 'Batch Norm (Ioffe & Szegedy, 2015)', and various baseline tools such as MAFFT (Katoh & Standley, 2013), MUSCLE (Edgar, 2004), T-Coffee (Di Tommaso et al., 2011), and PBSIM2 (Ono et al., 2020). However, it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for their implementation.
Experiment Setup Yes We preprocess our data by tokenizing sequences using a codon vocabulary of all 1-, 2-, and 3-mers. We learn a token and position embedding with dimension 64. Our sequence encoder and decoder are 4-layer transformers (Vaswani et al., 2017) with 8 attention heads and hidden dimension of size 64. Our set transformer also uses a hidden dimension of size 64 with 8 attention heads. On top of the base encoder we apply an additional 3-layer projection head all with dimension 64 and Batch Norm (Ioffe & Szegedy, 2015) layers between each linear layer. Decoding is performed via beam search with beam size 32. All models are trained with the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 and a batch size of 8 reads, although the total number of subreads present varies from batch to batch. We apply loss weighting values η = 10 and λ = 0.0001, and apply independent Gaussian noise to embeddings with a standard deviation of 0.01. Models and hyperparameters are selected based on validation LOO edit distance (Section 4.1).