scholarly journals PIPI: PTM-Invariant Peptide Identification Using Coding Method

2016 ◽  
Author(s):  
Fengchao Yu ◽  
Ning Li ◽  
Weichuan Yu

AbstractIn computational proteomics, identification of peptides with an unlimited number of post-translational modification (PTM) types is a challenging task. The computational cost increases exponentially with respect to the number of modifiable amino acids and linearly with respect to the number of potential PTM types at each amino acid. The problem becomes intractable very quickly if we want to enumerate all possible modification patterns. Existing tools (e.g., MS-Alignment, ProteinProspector, and MODa) avoid enumerating modification patterns in database search by using an alignment-based approach to localize and characterize modified amino acids. This approach avoids enumerating all possible modification patterns in a database search. However, due to the large search space and PTM localization issue, the sensitivity of these tools is low. This paper proposes a novel method named PIPI to achieve PTM-invariant peptide identification. PIPI first codes peptide sequences into Boolean vectors and converts experimental spectra into real-valued vectors. Then, it finds the top 10 peptide-coded vectors for each spectrum-coded vector. After that, PIPI uses a dynamic programming algorithm to localize and characterize modified amino acids. Simulations and real data experiments have shown that PIPI outperforms existing tools by identifying more peptide-spectrum matches (PSMs) and reporting fewer false positives. It also runs much faster than existing tools when the database is large.

2016 ◽  
Vol 14 (04) ◽  
pp. 1643001 ◽  
Author(s):  
Jin Li ◽  
Chengzhen Xu ◽  
Lei Wang ◽  
Hong Liang ◽  
Weixing Feng ◽  
...  

Prediction of RNA secondary structures is an important problem in computational biology and bioinformatics, since RNA secondary structures are fundamental for functional analysis of RNA molecules. However, small RNA secondary structures are scarce and few algorithms have been specifically designed for predicting the secondary structures of small RNAs. Here we propose an algorithm named “PSRna” for predicting small-RNA secondary structures using reverse complementary folding and characteristic hairpin loops of small RNAs. Unlike traditional algorithms that usually generate multi-branch loops and 5[Formula: see text] end self-folding, PSRna first estimated the maximum number of base pairs of RNA secondary structures based on the dynamic programming algorithm and a path matrix is constructed at the same time. Second, the backtracking paths are extracted from the path matrix based on backtracking algorithm, and each backtracking path represents a secondary structure. To improve accuracy, the predicted RNA secondary structures are filtered based on their free energy, where only the secondary structure with the minimum free energy was identified as the candidate secondary structure. Our experiments on real data show that the proposed algorithm is superior to two popular methods, RNAfold and RNAstructure, in terms of sensitivity, specificity and Matthews correlation coefficient (MCC).


2018 ◽  
Vol 29 (01) ◽  
pp. 63-90 ◽  
Author(s):  
Safia Kedad-Sidhoum ◽  
Florence Monna ◽  
Grégory Mounié ◽  
Denis Trystram

More and more parallel computing platforms are built upon hybrid architectures combining multi-core processors (CPUs) and hardware accelerators like General Purpose Graphics Processing Units (GPGPUs). We present in this paper a new method for scheduling efficiently parallel applications with [Formula: see text] CPUs and [Formula: see text] GPGPUs, where each task of the application can be processed either on an usual core (CPU) or on a GPGPU. We consider the problem of scheduling [Formula: see text] independent tasks with the objective to minimize the time for completing the whole application (makespan). This problem is NP-hard, thus, we present two families of approximation algorithms that can achieve approximation ratios of [Formula: see text] or [Formula: see text] for any integer [Formula: see text] when only one GPGPU is considered, and [Formula: see text] or [Formula: see text] for [Formula: see text] GPGPUs, where [Formula: see text] is an arbitrary small value which corresponds to the target accuracy of a binary search. The proposed method is based on a dual approximation scheme that uses a dynamic programming algorithm. The associated computational costs are for the first (resp. second) family in [Formula: see text] (resp. [Formula: see text]) per step of dual approximation. The greater the value of parameter [Formula: see text], the better the approximation, but the more expensive the computational cost. Finally, we propose a relaxed version of the algorithm which achieves a running time in [Formula: see text] with a constant approximation bound of [Formula: see text]. This last result is compared to the state-of-the-art algorithm HEFT. The proposed solving method is the first general purpose algorithm for scheduling on hybrid machines with a theoretical performance guarantee that can be used for practical purposes.


2019 ◽  
Vol 08 (04) ◽  
pp. 1950014 ◽  
Author(s):  
Yunlong Wang ◽  
Changliang Zou ◽  
Zhaojun Wang ◽  
Guosheng Yin

Change-point detection is an integral component of statistical modeling and estimation. For high-dimensional data, classical methods based on the Mahalanobis distance are typically inapplicable. We propose a novel testing statistic by combining a modified Euclidean distance and an extreme statistic, and its null distribution is asymptotically normal. The new method naturally strikes a balance between the detection abilities for both dense and sparse changes, which gives itself an edge to potentially outperform existing methods. Furthermore, the number of change-points is determined by a new Schwarz’s information criterion together with a pre-screening procedure, and the locations of the change-points can be estimated via the dynamic programming algorithm in conjunction with the intrinsic order structure of the objective function. Under some mild conditions, we show that the new method provides consistent estimation with an almost optimal rate. Simulation studies show that the proposed method has satisfactory performance of identifying multiple change-points in terms of power and estimation accuracy, and two real data examples are used for illustration.


2015 ◽  
Vol 77 (20) ◽  
Author(s):  
F. N. Muhamad ◽  
R. B. Ahmad ◽  
S. Mohd. Asi ◽  
M. N. Murad

The fundamental procedure of analyzing sequence content is sequence comparison. Sequence comparison can be defined as the problem of finding which parts of the sequences are similar and which parts are different, namely comparing two sequences to identify similarities and differences between them. A typical approach to solve this problem is to find a good and reasonable alignment between the two sequences. The main research in this project is to align the DNA sequences by using the Needleman-Wunsch algorithm for global alignment and Smith-Waterman algorithm for local alignment based on the Dynamic Programming algorithm. The Dynamic Programming Algorithm is guaranteed to find optimal alignment by exploring all possible alignments and choosing the best through the scoring and traceback techniques. The algorithms proposed and evaluated are to reduce the gaps in aligning sequences as well as the length of the sequences aligned without compromising the quality or correctness of results. In order to verify the accuracy and consistency of measurements obtained in Needleman-Wunsch and Smith-Waterman algorithms the data is compared with Emboss (global) and Emboss (local) with 600 strands test data.


2014 ◽  
Vol 2014 ◽  
pp. 1-13
Author(s):  
Xiang Li ◽  
Mohammad Reza Bonyadi ◽  
Zbigniew Michalewicz ◽  
Luigi Barone

This paper presents a hybrid evolutionary algorithm to deal with the wheat blending problem. The unique constraints of this problem make many existing algorithms fail: either they do not generate acceptable results or they are not able to complete optimization within the required time. The proposed algorithm starts with a filtering process that follows predefined rules to reduce the search space. Then the linear-relaxed version of the problem is solved using a standard linear programming algorithm. The result is used in conjunction with a solution generated by a heuristic method to generate an initial solution. After that, a hybrid of an evolutionary algorithm, a heuristic method, and a linear programming solver is used to improve the quality of the solution. A local search based posttuning method is also incorporated into the algorithm. The proposed algorithm has been tested on artificial test cases and also real data from past years. Results show that the algorithm is able to find quality results in all cases and outperforms the existing method in terms of both quality and speed.


2019 ◽  
Vol 47 (13) ◽  
pp. e77-e77
Author(s):  
Xinzhou Ge ◽  
Haowen Zhang ◽  
Lingjue Xie ◽  
Wei Vivian Li ◽  
Soo Bin Kwon ◽  
...  

AbstractThe availability of genome-wide epigenomic datasets enables in-depth studies of epigenetic modifications and their relationships with chromatin structures and gene expression. Various alignment tools have been developed to align nucleotide or protein sequences in order to identify structurally similar regions. However, there are currently no alignment methods specifically designed for comparing multi-track epigenomic signals and detecting common patterns that may explain functional or evolutionary similarities. We propose a new local alignment algorithm, EpiAlign, designed to compare chromatin state sequences learned from multi-track epigenomic signals and to identify locally aligned chromatin regions. EpiAlign is a dynamic programming algorithm that novelly incorporates varying lengths and frequencies of chromatin states. We demonstrate the efficacy of EpiAlign through extensive simulations and studies on the real data from the NIH Roadmap Epigenomics project. EpiAlign is able to extract recurrent chromatin state patterns along a single epigenome, and many of these patterns carry cell-type-specific characteristics. EpiAlign can also detect common chromatin state patterns across multiple epigenomes, and it will serve as a useful tool to group and distinguish epigenomic samples based on genome-wide or local chromatin state patterns.


Author(s):  
Rui Wang ◽  
Xin Xin ◽  
Wei Chang ◽  
Kun Ming ◽  
Biao Li ◽  
...  

In this paper, we investigate how to improve Chinese named entity recognition (NER) by jointly modeling NER and constituent parsing, in the framework of neural conditional random fields (CRF). We reformulate the parsing task to heightlimited constituent parsing, by which the computational complexity can be significantly reduced, and the majority of phrase-level grammars are retained. Specifically, an unified model of neural semi-CRF and neural tree-CRF is proposed, which simultaneously conducts word segmentation, part-ofspeech (POS) tagging, NER, and parsing. The challenge comes from how to train and infer the joint model, which has not been solved previously. We design a dynamic programming algorithm for both training and inference, whose complexity is O(n·4h), where n is the sentence length and h is the height limit. In addition, we derive a pruning algorithm for the joint model, which further prunes 99.9% of the search space with 2% loss of the ground truth data. Experimental results on the OntoNotes 4.0 dataset have demonstrated that the proposed model outperforms the state-of-the-art method by 2.79 points in the F1-measure.


2021 ◽  
Author(s):  
Le Zhang ◽  
Geng Liu ◽  
Guixue Hou ◽  
Haitao Xiang ◽  
Xi Zhang ◽  
...  

Although database search tools originally developed for shotgun proteome have been widely used in immunopeptidomic mass spectrometry identifications, they have been reported to achieve undesirably low sensitivities and/or high false positive rates as a result of the hugely inflated search space caused by the lack of specific enzymic digestions in immunopeptidome. To overcome such a problem, we have developed a motif-guided immunopeptidome database building tool named IntroSpect, which is designed to first learn the peptide motifs from high confidence hits in the initial search and then build a targeted database for refined search. Evaluated on three representative HLA class I datasets, IntroSpect can improve the sensitivity by an average of 80% comparing to conventional searches with unspecific digestions while maintaining a very high accuracy (~96%) as confirmed by synthetic validation experiments. A distinct advantage of IntroSpect is that it does not depend on any external HLA data so that it performs equally well on both well-studied and poorly-studied HLA types, unlike a previously developed method SpectMHC. We have also designed IntroSpect to keep a global FDR that can be conveniently controlled, similar to conventional database search engines. Finally, we demonstrate the practical value of IntroSpect by discovering neoantigens from MS data directly. IntroSpect is freely available at https://github.com/BGI2016/IntroSpect.


2021 ◽  
Author(s):  
Yiling Elaine Chen ◽  
Kyla Woyshner ◽  
MeiLu McDermott ◽  
Antigoni Manousopoulou ◽  
Scott Ficarro ◽  
...  

AbstractAdvances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide-spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with guaranteed control on the false discovery rate (FDR) and guaranteed increase in the identified peptides. To fill in this gap, we propose a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under a target FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex protein standard shows that APIR outpowers individual database search algorithms and guarantees the FDR control. Real data studies show that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. Note that the APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data.


2012 ◽  
Vol 198-199 ◽  
pp. 1527-1530
Author(s):  
Xue Min Zhang ◽  
Xiao Wen Chen ◽  
Jia Lin Jiao

Using the advantages of exhaustive dynamic programming algorithm, on the basic ideas of the global optimal solution is derived based on local optimal solution, this paper propose a new structural selection join algorithm. The algorithm connects to the sub-tree, and then connects to the structure of the whole. Though not guaranteed optimal solution, this algorithm can improve much in the time complexity, reduce the search space and improve efficiency.


Sign in / Sign up

Export Citation Format

Share Document