scholarly journals Improving peptide-spectrum matching by fragmentation prediction using Hidden Markov Models

2018 ◽  
Author(s):  
Ufuk Kirik ◽  
Jan C. Refsgaard ◽  
Lars J. Jensen

AbstractTandem mass-spectrometry has become the method of choice for high-throughput, quantitative analysis in proteomics. However, since the link between the peptides and the proteins they originate from is typically broken, identification of the analyzed peptides relies on matching of the fragmentation spectra (MS2) to theoretical spectra of possible candidate peptides, often filtered for precursor ion mass. To this end, peptide-spectrum matching algorithms score the concordance between the experimental and the theoretical spectra of candidate peptides, by evaluating the number (or proportion) of theoretically possible fragment ions observed in the experimental spectra, without any discrimination. However, the assumption that each theoretical fragment is just as likely to be observed is inaccurate. On the contrary, MS2 spectra often have few dominant fragments.We propose a novel prediction algorithm based on a hidden Markov model, which allow for the training process to be carried out very efficiently. Using millions of MS/MS spectra generated in our lab, we found an overall good reproducibility across different fragmentation spectra, given the precursor peptide and charge state. This result implies that there is indeed a pattern to fragmentation that justifies using machine learning methods. Furthermore, the overall agreement between spectra of the same peptide at the same charge state serves as an upper limit on how well prediction algorithms can be expected to perform.We have investigated the performance of a third order HMM model, trained on several million MS2 spectra, in various ways. Compared to a mock model, in which the fragment ions and their intensities are shuffled, we see a clear difference in prediction accuracy using our model. This result indicates that our model can pick up meaningful patterns, i.e. we can indeed learn the fragmentation process. Secondly, looking at the variability of the prediction performance by varying the train/test data split, in a K-fold cross validation scheme, we observed an overall robust model that performs well independent of the specific peptides that are present in the training data.Last but not least, we propose that the real value of this model is as a pre-processing step in the peptide identification process, by discerning fragment ions that are unlikely to be intense for a given candidate peptide, rather than using the actual predicted intensities. As such, probabilistic measures of concordance between experimental and theoretical spectra, would leverage better statistics.

2018 ◽  
Vol 35 (13) ◽  
pp. 2208-2215 ◽  
Author(s):  
Ioannis A Tamposis ◽  
Konstantinos D Tsirigos ◽  
Margarita C Theodoropoulou ◽  
Panagiota I Kontou ◽  
Pantelis G Bagos

Abstract Motivation Hidden Markov Models (HMMs) are probabilistic models widely used in applications in computational sequence analysis. HMMs are basically unsupervised models. However, in the most important applications, they are trained in a supervised manner. Training examples accompanied by labels corresponding to different classes are given as input and the set of parameters that maximize the joint probability of sequences and labels is estimated. A main problem with this approach is that, in the majority of the cases, labels are hard to find and thus the amount of training data is limited. On the other hand, there are plenty of unclassified (unlabeled) sequences deposited in the public databases that could potentially contribute to the training procedure. This approach is called semi-supervised learning and could be very helpful in many applications. Results We propose here, a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially labeled data in a straightforward manner. The algorithm is based on a variant of the Expectation-Maximization (EM) algorithm, where the missing labels of the unlabeled or partially labeled data are considered as the missing data. We apply the algorithm to several biological problems, namely, for the prediction of transmembrane protein topology for alpha-helical and beta-barrel membrane proteins and for the prediction of archaeal signal peptides. The results are very promising, since the algorithms presented here can significantly improve the prediction performance of even the top-scoring classifiers. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Brian Yao ◽  
Chloe Hsu ◽  
Gal Goldner ◽  
Yael Michaeli ◽  
Yuval Ebenstein ◽  
...  

Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5mC and 6mA. These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA k-mer backgrounds—a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with Hidden Markov Models (HMMs) that cannot make successful calls for k-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen k-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, Amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. Such an approach is expected to be useful for calling 5hmC and combinations of cytosine modifications, where complete training data are not likely to be available.


2000 ◽  
Vol 12 (6) ◽  
pp. 1371-1398 ◽  
Author(s):  
Herbert Jaeger

A widely used class of models for stochastic systems is hidden Markov models. Systems that can be modeled by hidden Markov models are a proper subclass of linearly dependent processes, a class of stochastic systems known from mathematical investigations carried out over the past four decades. This article provides a novel, simple characterization of linearly dependent processes, called observable operator models. The mathematical properties of observable operator models lead to a constructive learning algorithm for the identification of linearly dependent processes. The core of the algorithm has a time complexity of O (N + nm3), where N is the size of training data, n is the number of distinguishable outcomes of observations, and m is model state-space dimension.


Sign in / Sign up

Export Citation Format

Share Document