scholarly journals Proteomics Standards Initiative Extended FASTA Format (PEFF)

2019 ◽  
Author(s):  
Pierre-Alain Binz ◽  
Jim Shofstahl ◽  
Juan Antonio Vizcaíno ◽  
Harald Barsnes ◽  
Robert J. Chalkley ◽  
...  

AbstractMass spectrometry-based proteomics enables the high-throughput identification and quantification of proteins, including sequence variants and post-translational modifications (PTMs), in biological samples. However, most workflows require that such variations be included in the search space used to analyze the data, and doing so remains challenging with most analysis tools. In order to facilitate the search for known sequence variants and PTMs, the Proteomics Standards Initiative (PSI) has designed and implemented the PSI Extended FASTA Format (PEFF). PEFF is based on the very popular FASTA format but adds a uniform mechanism for encoding substantially more metadata about the sequence collection as well as individual entries, including support for encoding known sequence variants, PTMs, and proteoforms. The format is very nearly backwards compatible, and as such, existing FASTA parsers will require little or no changes to be able to read PEFF files as FASTA files, although without supporting any of the extra capabilities of PEFF. PEFF is defined by a full specification document, controlled vocabulary terms, a set of example files, software libraries, and a file validator. Popular software and resources are starting to support PEFF, including the sequence search engine Comet and the knowledge bases neXtProt and UniProtKB. Widespread implementation of PEFF is expected to further enable proteogenomics and top-down proteomics applications by providing a standardized mechanism for encoding protein sequences and their known variations. All the related documentation, including the detailed file format specification and example files, are available athttp://www.psidev.info/peff.

2013 ◽  
Vol 11 (04) ◽  
pp. 1350007 ◽  
Author(s):  
LIN HE ◽  
XI HAN ◽  
BIN MA

De novo sequencing derives the peptide sequence from a tandem mass spectrum without the assistance of protein databases. This analysis has been indispensable for the identification of novel or modified peptides in a biological sample. Currently, the speed of de novo sequencing algorithms is not heavily affected by the number of post-translational modification (PTM) types in consideration. However, the accuracy of the algorithms can be degraded due to the increased search space. Most peptides in a proteomics research contain only a small number of PTMs per peptide, yet the types of PTMs can come from a large number of choices. Therefore, it is desirable to include a large number of PTM types in a de novo sequencing algorithm, yet to limit the number of PTM occurrences in each peptide to increase the accuracy. In this paper, we present an efficient de novo sequencing algorithm, DeNovoPTM, for such a purpose. The implemented software is downloadable from http://www.cs.uwaterloo.ca/~l22he/denovo_ptm .


Cancers ◽  
2021 ◽  
Vol 13 (20) ◽  
pp. 5034
Author(s):  
Amol Prakash ◽  
Lorne Taylor ◽  
Manu Varkey ◽  
Nate Hoxie ◽  
Yassene Mohammed ◽  
...  

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) has provided some of the most in-depth analyses of the phenotypes of human tumors ever constructed. Today, the majority of proteomic data analysis is still performed using software housed on desktop computers which limits the number of sequence variants and post-translational modifications that can be considered. The original CPTAC studies limited the search for PTMs to only samples that were chemically enriched for those modified peptides. Similarly, the only sequence variants considered were those with strong evidence at the exon or transcript level. In this multi-institutional collaborative reanalysis, we utilized unbiased protein databases containing millions of human sequence variants in conjunction with hundreds of common post-translational modifications. Using these tools, we identified tens of thousands of high-confidence PTMs and sequence variants. We identified 4132 phosphorylated peptides in nonenriched samples, 93% of which were confirmed in the samples which were chemically enriched for phosphopeptides. In addition, our results also cover 90% of the high-confidence variants reported by the original proteogenomics study, without the need for sample specific next-generation sequencing. Finally, we report fivefold more somatic and germline variants that have an independent evidence at the peptide level, including mutations in ERRB2 and BCAS1. In this reanalysis of CPTAC proteomic data with cloud computing, we present an openly available and searchable web resource of the highest-coverage proteomic profiling of human tumors described to date.


2021 ◽  
Author(s):  
Javier Guillot Jiménez ◽  
Luiz André P. Paes Leme ◽  
Yenier Torres Izquierdo ◽  
Angelo Batista Neves ◽  
Marco A. Casanova

The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. This question can be addressed by implementing a path search strategy, which combines an entity similarity measure, with an expansion limit, to reduce the path search space and a path ranking measure to order the relevant paths between a given pair of entities in the RDF graph. This paper first introduces DCoEPinKB, an in-memory distributed framework that addresses the entity relatedness problem. Then, it presents an evaluation of path search strategies using DCoEPinKB over real data collected from DBpedia. The results provide insights about the performance of the path search strategies.


Author(s):  
Rao M. Kotamarti ◽  
Mitchell A. Thornton ◽  
Margaret H. Dunham

Many classes of algorithms that suffer from large complexities when implemented on conventional computers may be reformulated resulting in greatly reduced complexity when implemented on quantum computers. The dramatic reductions in complexity for certain types of quantum algorithms coupled with the computationally challenging problems in some bioinformatics problems motivates researchers to devise efficient quantum algorithms for sequence (DNA, RNA, protein) analysis. This chapter shows that the important sequence classification problem in bioinformatics is suitable for formulation as a quantum algorithm. This chapter leverages earlier research for sequence classification based on Extensible Markov Model (EMM) and proposes a quantum computing alternative. The authors utilize sequence family profiles built using EMM methodology which is based on using pre-counted word data for each sequence. Then a new method termed quantum seeding is proposed for generating a key based on high frequency words. The key is applied in a quantum search based on Grover algorithm to determine a candidate set of models resulting in a significantly reduced search space. Given Z as a function of M models of size N, the quantum version of the seeding algorithm has a time complexity in the order of as opposed to O(Z) for the standard classic version for large values of Z.


1989 ◽  
Vol 28 (02) ◽  
pp. 78-85 ◽  
Author(s):  
R. Linnarsson ◽  
O. Wigertz

Abstract:The medical information systems of the future will probably include the entire medical record as well as a knowledge base, providing decision support for the physician during patient care. Data dictionaries will play an important role in integrating the medical knowledge bases with the clinical databases.This article presents an infological data model of such an integrated medical information system. Medical events, medical terms, and medical facts are the basic concepts that constitute the model. To allow the transfer of information and knowledge between systems, the data dictionary should be organized with regard to several common classification schemes of medical nomenclature.


2013 ◽  
pp. 1705-1726
Author(s):  
Rao M. Kotamarti ◽  
Mitchell A. Thornton ◽  
Margaret H. Dunham

Many classes of algorithms that suffer from large complexities when implemented on conventional computers may be reformulated resulting in greatly reduced complexity when implemented on quantum computers. The dramatic reductions in complexity for certain types of quantum algorithms coupled with the computationally challenging problems in some bioinformatics problems motivates researchers to devise efficient quantum algorithms for sequence (DNA, RNA, protein) analysis. This chapter shows that the important sequence classification problem in bioinformatics is suitable for formulation as a quantum algorithm. This chapter leverages earlier research for sequence classification based on Extensible Markov Model (EMM) and proposes a quantum computing alternative. The authors utilize sequence family profiles built using EMM methodology which is based on using pre-counted word data for each sequence. Then a new method termed quantum seeding is proposed for generating a key based on high frequency words. The key is applied in a quantum search based on Grover algorithm to determine a candidate set of models resulting in a significantly reduced search space. Given Z as a function of M models of size N, the quantum version of the seeding algorithm has a time complexity in the order of as opposed to O(Z) for the standard classic version for large values of Z.


2021 ◽  
Author(s):  
Javier Guillot Jiménez ◽  
Luiz André P. Paes Leme ◽  
Marco A. Casanova

A knowledge base, expressed using the Resource Description Framework (RDF), can be viewed as a graph whose nodes represent entities and whose edges denote relationships. The entity relatedness problem refers to the problem of discovering and understanding how two entities are related, directly or indirectly, that is, how they are connected by paths in a knowledge base. Strategies designed to solve the entity relatedness problem typically adopt an entity similarity measure to reduce the path search space and a path ranking measure to order and filter the list of paths returned. This paper presents a framework, called CoEPinKB, that supports the empirical evaluation of such strategies. The proposed framework allows combining entity similarity and path ranking measures to generate different path search strategies. The main goals of this paper are to describe the framework and present a performance evaluation of nine different path search strategies.


2020 ◽  
Vol 477 (7) ◽  
pp. 1219-1225 ◽  
Author(s):  
Nikolai N. Sluchanko

Many major protein–protein interaction networks are maintained by ‘hub’ proteins with multiple binding partners, where interactions are often facilitated by intrinsically disordered protein regions that undergo post-translational modifications, such as phosphorylation. Phosphorylation can directly affect protein function and control recognition by proteins that ‘read’ the phosphorylation code, re-wiring the interactome. The eukaryotic 14-3-3 proteins recognizing multiple phosphoproteins nicely exemplify these concepts. Although recent studies established the biochemical and structural basis for the interaction of the 14-3-3 dimers with several phosphorylated clients, understanding their assembly with partners phosphorylated at multiple sites represents a challenge. Suboptimal sequence context around the phosphorylated residue may reduce binding affinity, resulting in quantitative differences for distinct phosphorylation sites, making hierarchy and priority in their binding rather uncertain. Recently, Stevers et al. [Biochemical Journal (2017) 474: 1273–1287] undertook a remarkable attempt to untangle the mechanism of 14-3-3 dimer binding to leucine-rich repeat kinase 2 (LRRK2) that contains multiple candidate 14-3-3-binding sites and is mutated in Parkinson's disease. By using the protein-peptide binding approach, the authors systematically analyzed affinities for a set of LRRK2 phosphopeptides, alone or in combination, to a 14-3-3 protein and determined crystal structures for 14-3-3 complexes with selected phosphopeptides. This study addresses a long-standing question in the 14-3-3 biology, unearthing a range of important details that are relevant for understanding binding mechanisms of other polyvalent proteins.


2020 ◽  
Vol 64 (1) ◽  
pp. 97-110
Author(s):  
Christian Sibbersen ◽  
Mogens Johannsen

Abstract In living systems, nucleophilic amino acid residues are prone to non-enzymatic post-translational modification by electrophiles. α-Dicarbonyl compounds are a special type of electrophiles that can react irreversibly with lysine, arginine, and cysteine residues via complex mechanisms to form post-translational modifications known as advanced glycation end-products (AGEs). Glyoxal, methylglyoxal, and 3-deoxyglucosone are the major endogenous dicarbonyls, with methylglyoxal being the most well-studied. There are several routes that lead to the formation of dicarbonyl compounds, most originating from glucose and glucose metabolism, such as the non-enzymatic decomposition of glycolytic intermediates and fructosyl amines. Although dicarbonyls are removed continuously mainly via the glyoxalase system, several conditions lead to an increase in dicarbonyl concentration and thereby AGE formation. AGEs have been implicated in diabetes and aging-related diseases, and for this reason the elucidation of their structure as well as protein targets is of great interest. Though the dicarbonyls and reactive protein side chains are of relatively simple nature, the structures of the adducts as well as their mechanism of formation are not that trivial. Furthermore, detection of sites of modification can be demanding and current best practices rely on either direct mass spectrometry or various methods of enrichment based on antibodies or click chemistry followed by mass spectrometry. Future research into the structure of these adducts and protein targets of dicarbonyl compounds may improve the understanding of how the mechanisms of diabetes and aging-related physiological damage occur.


2020 ◽  
Vol 64 (1) ◽  
pp. 135-153 ◽  
Author(s):  
Lauren Elizabeth Smith ◽  
Adelina Rogowska-Wrzesinska

Abstract Post-translational modifications (PTMs) are integral to the regulation of protein function, characterising their role in this process is vital to understanding how cells work in both healthy and diseased states. Mass spectrometry (MS) facilitates the mass determination and sequencing of peptides, and thereby also the detection of site-specific PTMs. However, numerous challenges in this field continue to persist. The diverse chemical properties, low abundance, labile nature and instability of many PTMs, in combination with the more practical issues of compatibility with MS and bioinformatics challenges, contribute to the arduous nature of their analysis. In this review, we present an overview of the established MS-based approaches for analysing PTMs and the common complications associated with their investigation, including examples of specific challenges focusing on phosphorylation, lysine acetylation and redox modifications.


Sign in / Sign up

Export Citation Format

Share Document