Improving peptide-spectrum matching by fragmentation prediction using Hidden Markov Models

Mapping Intimacies ◽

10.1101/358283 ◽

2018 ◽

Author(s):

Ufuk Kirik ◽

Jan C. Refsgaard ◽

Lars J. Jensen

Keyword(s):

Charge State ◽

Markov Models ◽

Hidden Markov ◽

Peptide Identification ◽

Training Data ◽

Prediction Algorithm ◽

Candidate Peptide ◽

Fragment Ions ◽

Robust Model ◽

Spectrum Matching

AbstractTandem mass-spectrometry has become the method of choice for high-throughput, quantitative analysis in proteomics. However, since the link between the peptides and the proteins they originate from is typically broken, identification of the analyzed peptides relies on matching of the fragmentation spectra (MS2) to theoretical spectra of possible candidate peptides, often filtered for precursor ion mass. To this end, peptide-spectrum matching algorithms score the concordance between the experimental and the theoretical spectra of candidate peptides, by evaluating the number (or proportion) of theoretically possible fragment ions observed in the experimental spectra, without any discrimination. However, the assumption that each theoretical fragment is just as likely to be observed is inaccurate. On the contrary, MS2 spectra often have few dominant fragments.We propose a novel prediction algorithm based on a hidden Markov model, which allow for the training process to be carried out very efficiently. Using millions of MS/MS spectra generated in our lab, we found an overall good reproducibility across different fragmentation spectra, given the precursor peptide and charge state. This result implies that there is indeed a pattern to fragmentation that justifies using machine learning methods. Furthermore, the overall agreement between spectra of the same peptide at the same charge state serves as an upper limit on how well prediction algorithms can be expected to perform.We have investigated the performance of a third order HMM model, trained on several million MS2 spectra, in various ways. Compared to a mock model, in which the fragment ions and their intensities are shuffled, we see a clear difference in prediction accuracy using our model. This result indicates that our model can pick up meaningful patterns, i.e. we can indeed learn the fragmentation process. Secondly, looking at the variability of the prediction performance by varying the train/test data split, in a K-fold cross validation scheme, we observed an overall robust model that performs well independent of the specific peptides that are present in the training data.Last but not least, we propose that the real value of this model is as a pre-processing step in the peptide identification process, by discerning fragment ions that are unlikely to be intense for a given candidate peptide, rather than using the actual predicted intensities. As such, probabilistic measures of concordance between experimental and theoretical spectra, would leverage better statistics.

Download Full-text

Semi-supervised learning of Hidden Markov Models for biological sequence analysis

Bioinformatics ◽

10.1093/bioinformatics/bty910 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2208-2215 ◽

Cited By ~ 5

Author(s):

Ioannis A Tamposis ◽

Konstantinos D Tsirigos ◽

Margarita C Theodoropoulou ◽

Panagiota I Kontou ◽

Pantelis G Bagos

Keyword(s):

Sequence Analysis ◽

Supervised Learning ◽

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Transmembrane Protein ◽

Training Data ◽

Supplementary Information ◽

Training Procedure ◽

Partially Labeled Data

Abstract Motivation Hidden Markov Models (HMMs) are probabilistic models widely used in applications in computational sequence analysis. HMMs are basically unsupervised models. However, in the most important applications, they are trained in a supervised manner. Training examples accompanied by labels corresponding to different classes are given as input and the set of parameters that maximize the joint probability of sequences and labels is estimated. A main problem with this approach is that, in the majority of the cases, labels are hard to find and thus the amount of training data is limited. On the other hand, there are plenty of unclassified (unlabeled) sequences deposited in the public databases that could potentially contribute to the training procedure. This approach is called semi-supervised learning and could be very helpful in many applications. Results We propose here, a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially labeled data in a straightforward manner. The algorithm is based on a variant of the Expectation-Maximization (EM) algorithm, where the missing labels of the unlabeled or partially labeled data are considered as the missing data. We apply the algorithm to several biological problems, namely, for the prediction of transmembrane protein topology for alpha-helical and beta-barrel membrane proteins and for the prediction of archaeal signal peptides. The results are very promising, since the algorithms presented here can significantly improve the prediction performance of even the top-scoring classifiers. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

HMMatch: Peptide Identification by Spectral Matching of Tandem Mass Spectra Using Hidden Markov Models

Journal of Computational Biology ◽

10.1089/cmb.2007.0071 ◽

2007 ◽

Vol 14 (8) ◽

pp. 1025-1043 ◽

Cited By ~ 12

Author(s):

Xue Wu ◽

Chau-Wen Tseng ◽

Nathan Edwards

Keyword(s):

Hidden Markov Models ◽

Mass Spectra ◽

Markov Models ◽

Hidden Markov ◽

Peptide Identification ◽

Tandem Mass ◽

Spectral Matching ◽

Tandem Mass Spectra

Download Full-text

Next State Prediction algorithm for the avionic systems using the hidden Markov models

2015 International Conference on Green Computing and Internet of Things (ICGCIoT) ◽

10.1109/icgciot.2015.7380726 ◽

2015 ◽

Author(s):

Lokesh M R ◽

Y.S Kumaraswamy

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Prediction Algorithm ◽

State Prediction

Download Full-text

Nanopore callers for epigenetics from limited supervised data

10.1101/2021.06.17.448800 ◽

2021 ◽

Author(s):

Brian Yao ◽

Chloe Hsu ◽

Gal Goldner ◽

Yael Michaeli ◽

Yuval Ebenstein ◽

...

Keyword(s):

Incomplete Data ◽

Deep Neural Networks ◽

Markov Models ◽

Hidden Markov ◽

Training Data ◽

Supervised Machine Learning ◽

Training Dataset ◽

Nanopore Sequencing ◽

Base Modifications ◽

Sequencing Platforms

Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5mC and 6mA. These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA k-mer backgrounds—a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with Hidden Markov Models (HMMs) that cannot make successful calls for k-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen k-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, Amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. Such an approach is expected to be useful for calling 5hmC and combinations of cytosine modifications, where complete training data are not likely to be available.

Download Full-text

Observable Operator Models for Discrete Stochastic Time Series

Neural Computation ◽

10.1162/089976600300015411 ◽

2000 ◽

Vol 12 (6) ◽

pp. 1371-1398 ◽

Cited By ~ 48

Author(s):

Herbert Jaeger

Keyword(s):

Hidden Markov Models ◽

Stochastic Systems ◽

Markov Models ◽

Learning Algorithm ◽

Hidden Markov ◽

Training Data ◽

Constructive Learning ◽

Proper Subclass ◽

Stochastic Time Series ◽

Dependent Processes

A widely used class of models for stochastic systems is hidden Markov models. Systems that can be modeled by hidden Markov models are a proper subclass of linearly dependent processes, a class of stochastic systems known from mathematical investigations carried out over the past four decades. This article provides a novel, simple characterization of linearly dependent processes, called observable operator models. The mathematical properties of observable operator models lead to a constructive learning algorithm for the identification of linearly dependent processes. The core of the algorithm has a time complexity of O (N + nm3), where N is the size of training data, n is the number of distinguishable outcomes of observations, and m is model state-space dimension.

Download Full-text

Noise-Robust Hidden Markov Models for Limited Training Data for Within-Species Bird Phrase Classification

10.21437/interspeech.2016-1360 ◽

2016 ◽

Author(s):

Kantapon Kaewtip ◽

Charles Taylor ◽

Abeer Alwan

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Training Data ◽

Noise Robust

Download Full-text

A direct-concatenation approach to train hidden Markov models to recognize the highly confusing Mandarin syllables with very limited training data

IEEE Transactions on Speech and Audio Processing ◽

10.1109/89.221375 ◽

1993 ◽

Vol 1 (1) ◽

pp. 113-119 ◽

Cited By ~ 4

Author(s):

F. Liu ◽

Y. Lee ◽

L. Lee

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Training Data

Download Full-text

Robust Hidden Markov Models for limited training data for birdsong phrase classification

The Journal of the Acoustical Society of America ◽

10.1121/1.4988171 ◽

2017 ◽

Vol 141 (5) ◽

pp. 3725-3726

Author(s):

Kantapon Kaewtip ◽

Abeer Alwan ◽

Charles Taylor

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Training Data

Download Full-text