Proteomics Standards Initiative Extended FASTA Format (PEFF)

DE NOVO SEQUENCING WITH LIMITED NUMBER OF POST-TRANSLATIONAL MODIFICATIONS PER PEPTIDE

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013500078 ◽

2013 ◽

Vol 11 (04) ◽

pp. 1350007 ◽

Cited By ~ 6

Author(s):

LIN HE ◽

XI HAN ◽

BIN MA

Keyword(s):

De Novo ◽

Search Space ◽

De Novo Sequencing ◽

Peptide Sequence ◽

Post Translational Modification ◽

Post Translational Modifications ◽

Protein Databases ◽

Sequencing Algorithm ◽

Modified Peptides ◽

Proteomics Research

De novo sequencing derives the peptide sequence from a tandem mass spectrum without the assistance of protein databases. This analysis has been indispensable for the identification of novel or modified peptides in a biological sample. Currently, the speed of de novo sequencing algorithms is not heavily affected by the number of post-translational modification (PTM) types in consideration. However, the accuracy of the algorithms can be degraded due to the increased search space. Most peptides in a proteomics research contain only a small number of PTMs per peptide, yet the types of PTMs can come from a large number of choices. Therefore, it is desirable to include a large number of PTM types in a de novo sequencing algorithm, yet to limit the number of PTM occurrences in each peptide to increase the accuracy. In this paper, we present an efficient de novo sequencing algorithm, DeNovoPTM, for such a purpose. The implemented software is downloadable from http://www.cs.uwaterloo.ca/~l22he/denovo_ptm .

Download Full-text

Reinspection of a Clinical Proteomics Tumor Analysis Consortium (CPTAC) Dataset with Cloud Computing Reveals Abundant Post-Translational Modifications and Protein Sequence Variants

Cancers ◽

10.3390/cancers13205034 ◽

2021 ◽

Vol 13 (20) ◽

pp. 5034

Author(s):

Amol Prakash ◽

Lorne Taylor ◽

Manu Varkey ◽

Nate Hoxie ◽

Yassene Mohammed ◽

...

Keyword(s):

Cloud Computing ◽

Human Tumors ◽

Clinical Proteomics ◽

Sequence Variants ◽

Proteomic Profiling ◽

High Confidence ◽

Independent Evidence ◽

Post Translational Modifications ◽

Web Resource ◽

Proteomic Data

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) has provided some of the most in-depth analyses of the phenotypes of human tumors ever constructed. Today, the majority of proteomic data analysis is still performed using software housed on desktop computers which limits the number of sequence variants and post-translational modifications that can be considered. The original CPTAC studies limited the search for PTMs to only samples that were chemically enriched for those modified peptides. Similarly, the only sequence variants considered were those with strong evidence at the exon or transcript level. In this multi-institutional collaborative reanalysis, we utilized unbiased protein databases containing millions of human sequence variants in conjunction with hundreds of common post-translational modifications. Using these tools, we identified tens of thousands of high-confidence PTMs and sequence variants. We identified 4132 phosphorylated peptides in nonenriched samples, 93% of which were confirmed in the samples which were chemically enriched for phosphopeptides. In addition, our results also cover 90% of the high-confidence variants reported by the original proteogenomics study, without the need for sample specific next-generation sequencing. Finally, we report fivefold more somatic and germline variants that have an independent evidence at the peptide level, including mutations in ERRB2 and BCAS1. In this reanalysis of CPTAC proteomic data with cloud computing, we present an openly available and searchable web resource of the highest-coverage proteomic profiling of human tumors described to date.

Download Full-text

A distributed framework to investigate the entity relatedness problem in large RDF knowledge bases

10.5753/sbbd.2021.17871 ◽

2021 ◽

Author(s):

Javier Guillot Jiménez ◽

Luiz André P. Paes Leme ◽

Yenier Torres Izquierdo ◽

Angelo Batista Neves ◽

Marco A. Casanova

Keyword(s):

Knowledge Base ◽

Search Strategy ◽

Real Data ◽

Search Space ◽

Knowledge Bases ◽

Search Strategies ◽

Rdf Graph ◽

Distributed Framework ◽

Path Search ◽

Entity Relatedness

The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. This question can be addressed by implementing a path search strategy, which combines an entity similarity measure, with an expansion limit, to reduce the path search space and a path ranking measure to order the relevant paths between a given pair of entities in the RDF graph. This paper first introduces DCoEPinKB, an in-memory distributed framework that addresses the entity relatedness problem. Then, it presents an evaluation of path search strategies using DCoEPinKB over real data collected from DBpedia. The results provide insights about the performance of the path search strategies.

Download Full-text

Quantum Computing Approach for Alignment-Free Sequence Search and Classification

Multidisciplinary Computational Intelligence Techniques ◽

10.4018/978-1-4666-1830-5.ch017 ◽

2012 ◽

pp. 279-300

Author(s):

Rao M. Kotamarti ◽

Mitchell A. Thornton ◽

Margaret H. Dunham

Keyword(s):

Quantum Computing ◽

Quantum Algorithms ◽

Quantum Algorithm ◽

Search Space ◽

Classification Problem ◽

Sequence Classification ◽

Sequence Search ◽

Quantum Version ◽

Computing Approach ◽

Candidate Set

Many classes of algorithms that suffer from large complexities when implemented on conventional computers may be reformulated resulting in greatly reduced complexity when implemented on quantum computers. The dramatic reductions in complexity for certain types of quantum algorithms coupled with the computationally challenging problems in some bioinformatics problems motivates researchers to devise efficient quantum algorithms for sequence (DNA, RNA, protein) analysis. This chapter shows that the important sequence classification problem in bioinformatics is suitable for formulation as a quantum algorithm. This chapter leverages earlier research for sequence classification based on Extensible Markov Model (EMM) and proposes a quantum computing alternative. The authors utilize sequence family profiles built using EMM methodology which is based on using pre-counted word data for each sequence. Then a new method termed quantum seeding is proposed for generating a key based on high frequency words. The key is applied in a quantum search based on Grover algorithm to determine a candidate set of models resulting in a significantly reduced search space. Given Z as a function of M models of size N, the quantum version of the seeding algorithm has a time complexity in the order of as opposed to O(Z) for the standard classic version for large values of Z.

Download Full-text

The Data Dictionary – A Controlled Vocabulary for Integrating Clinical Databases and Medical Knowledge Bases

Methods of Information in Medicine ◽

10.1055/s-0038-1635556 ◽

1989 ◽

Vol 28 (02) ◽

pp. 78-85 ◽

Cited By ~ 19

Author(s):

R. Linnarsson ◽

O. Wigertz

Keyword(s):

Data Model ◽

Medical Information ◽

Medical Knowledge ◽

Knowledge Bases ◽

Controlled Vocabulary ◽

Medical Information System ◽

Data Dictionary ◽

Clinical Databases ◽

Transfer Of Information ◽

Integrated Medical Information System

Abstract:The medical information systems of the future will probably include the entire medical record as well as a knowledge base, providing decision support for the physician during patient care. Data dictionaries will play an important role in integrating the medical knowledge bases with the clinical databases.This article presents an infological data model of such an integrated medical information system. Medical events, medical terms, and medical facts are the basic concepts that constitute the model. To allow the transfer of information and knowledge between systems, the data dictionary should be organized with regard to several common classification schemes of medical nomenclature.

Download Full-text

Quantum Computing Approach for Alignment-Free Sequence Search and Classification

Bioinformatics ◽

10.4018/978-1-4666-3604-0.ch090 ◽

2013 ◽

pp. 1705-1726

Author(s):

Rao M. Kotamarti ◽

Mitchell A. Thornton ◽

Margaret H. Dunham

Keyword(s):

Quantum Computing ◽

Quantum Algorithms ◽

Quantum Algorithm ◽

Search Space ◽

Classification Problem ◽

Sequence Classification ◽

Sequence Search ◽

Quantum Version ◽

Computing Approach ◽

Candidate Set

Many classes of algorithms that suffer from large complexities when implemented on conventional computers may be reformulated resulting in greatly reduced complexity when implemented on quantum computers. The dramatic reductions in complexity for certain types of quantum algorithms coupled with the computationally challenging problems in some bioinformatics problems motivates researchers to devise efficient quantum algorithms for sequence (DNA, RNA, protein) analysis. This chapter shows that the important sequence classification problem in bioinformatics is suitable for formulation as a quantum algorithm. This chapter leverages earlier research for sequence classification based on Extensible Markov Model (EMM) and proposes a quantum computing alternative. The authors utilize sequence family profiles built using EMM methodology which is based on using pre-counted word data for each sequence. Then a new method termed quantum seeding is proposed for generating a key based on high frequency words. The key is applied in a quantum search based on Grover algorithm to determine a candidate set of models resulting in a significantly reduced search space. Given Z as a function of M models of size N, the quantum version of the seeding algorithm has a time complexity in the order of as opposed to O(Z) for the standard classic version for large values of Z.

Download Full-text

CoEPinKB: A Framework to Understand the Connectivity of Entity Pairs in Knowledge Bases

10.5753/semish.2021.15811 ◽

2021 ◽

Author(s):

Javier Guillot Jiménez ◽

Luiz André P. Paes Leme ◽

Marco A. Casanova

Keyword(s):

Knowledge Base ◽

Empirical Evaluation ◽

Search Space ◽

Knowledge Bases ◽

Search Strategies ◽

Path Search ◽

Description Framework ◽

Entity Relatedness ◽

Ranking Measures ◽

Resource Description

A knowledge base, expressed using the Resource Description Framework (RDF), can be viewed as a graph whose nodes represent entities and whose edges denote relationships. The entity relatedness problem refers to the problem of discovering and understanding how two entities are related, directly or indirectly, that is, how they are connected by paths in a knowledge base. Strategies designed to solve the entity relatedness problem typically adopt an entity similarity measure to reduce the path search space and a path ranking measure to order and filter the list of paths returned. This paper presents a framework, called CoEPinKB, that supports the empirical evaluation of such strategies. The proposed framework allows combining entity similarity and path ranking measures to generate different path search strategies. The main goals of this paper are to describe the framework and present a performance evaluation of nine different path search strategies.

Download Full-text

Reading the phosphorylation code: binding of the 14-3-3 protein to multivalent client phosphoproteins

Biochemical Journal ◽

10.1042/bcj20200084 ◽

2020 ◽

Vol 477 (7) ◽

pp. 1219-1225 ◽

Cited By ~ 4

Author(s):

Nikolai N. Sluchanko

Keyword(s):

Protein Function ◽

Intrinsically Disordered Protein ◽

Structural Basis ◽

Major Protein ◽

Post Translational Modifications ◽

Protein Protein Interaction ◽

Intrinsically Disordered ◽

Biochemical Journal ◽

Multiple Binding ◽

And Control

Many major protein–protein interaction networks are maintained by ‘hub’ proteins with multiple binding partners, where interactions are often facilitated by intrinsically disordered protein regions that undergo post-translational modifications, such as phosphorylation. Phosphorylation can directly affect protein function and control recognition by proteins that ‘read’ the phosphorylation code, re-wiring the interactome. The eukaryotic 14-3-3 proteins recognizing multiple phosphoproteins nicely exemplify these concepts. Although recent studies established the biochemical and structural basis for the interaction of the 14-3-3 dimers with several phosphorylated clients, understanding their assembly with partners phosphorylated at multiple sites represents a challenge. Suboptimal sequence context around the phosphorylated residue may reduce binding affinity, resulting in quantitative differences for distinct phosphorylation sites, making hierarchy and priority in their binding rather uncertain. Recently, Stevers et al. [Biochemical Journal (2017) 474: 1273–1287] undertook a remarkable attempt to untangle the mechanism of 14-3-3 dimer binding to leucine-rich repeat kinase 2 (LRRK2) that contains multiple candidate 14-3-3-binding sites and is mutated in Parkinson's disease. By using the protein-peptide binding approach, the authors systematically analyzed affinities for a set of LRRK2 phosphopeptides, alone or in combination, to a 14-3-3 protein and determined crystal structures for 14-3-3 complexes with selected phosphopeptides. This study addresses a long-standing question in the 14-3-3 biology, unearthing a range of important details that are relevant for understanding binding mechanisms of other polyvalent proteins.

Download Full-text

Dicarbonyl derived post-translational modifications: chemistry bridging biology and aging-related disease

Essays in Biochemistry ◽

10.1042/ebc20190057 ◽

2020 ◽

Vol 64 (1) ◽

pp. 97-110

Author(s):

Christian Sibbersen ◽

Mogens Johannsen

Keyword(s):

Mass Spectrometry ◽

Future Research ◽

Amino Acid Residues ◽

Post Translational Modification ◽

Dicarbonyl Compounds ◽

Protein Targets ◽

Post Translational Modifications ◽

Reactive Protein ◽

Glycation End Products ◽

Direct Mass Spectrometry

Abstract In living systems, nucleophilic amino acid residues are prone to non-enzymatic post-translational modification by electrophiles. α-Dicarbonyl compounds are a special type of electrophiles that can react irreversibly with lysine, arginine, and cysteine residues via complex mechanisms to form post-translational modifications known as advanced glycation end-products (AGEs). Glyoxal, methylglyoxal, and 3-deoxyglucosone are the major endogenous dicarbonyls, with methylglyoxal being the most well-studied. There are several routes that lead to the formation of dicarbonyl compounds, most originating from glucose and glucose metabolism, such as the non-enzymatic decomposition of glycolytic intermediates and fructosyl amines. Although dicarbonyls are removed continuously mainly via the glyoxalase system, several conditions lead to an increase in dicarbonyl concentration and thereby AGE formation. AGEs have been implicated in diabetes and aging-related diseases, and for this reason the elucidation of their structure as well as protein targets is of great interest. Though the dicarbonyls and reactive protein side chains are of relatively simple nature, the structures of the adducts as well as their mechanism of formation are not that trivial. Furthermore, detection of sites of modification can be demanding and current best practices rely on either direct mass spectrometry or various methods of enrichment based on antibodies or click chemistry followed by mass spectrometry. Future research into the structure of these adducts and protein targets of dicarbonyl compounds may improve the understanding of how the mechanisms of diabetes and aging-related physiological damage occur.

Download Full-text

The challenge of detecting modifications on proteins

Essays in Biochemistry ◽

10.1042/ebc20190055 ◽

2020 ◽

Vol 64 (1) ◽

pp. 135-153 ◽

Cited By ~ 1

Author(s):

Lauren Elizabeth Smith ◽

Adelina Rogowska-Wrzesinska

Keyword(s):

Mass Spectrometry ◽

Protein Function ◽

Chemical Properties ◽

Lysine Acetylation ◽

Mass Determination ◽

Post Translational Modifications ◽

Site Specific ◽

The Common

Abstract Post-translational modifications (PTMs) are integral to the regulation of protein function, characterising their role in this process is vital to understanding how cells work in both healthy and diseased states. Mass spectrometry (MS) facilitates the mass determination and sequencing of peptides, and thereby also the detection of site-specific PTMs. However, numerous challenges in this field continue to persist. The diverse chemical properties, low abundance, labile nature and instability of many PTMs, in combination with the more practical issues of compatibility with MS and bioinformatics challenges, contribute to the arduous nature of their analysis. In this review, we present an overview of the established MS-based approaches for analysing PTMs and the common complications associated with their investigation, including examples of specific challenges focusing on phosphorylation, lysine acetylation and redox modifications.

Download Full-text