Invariants of Frameshifted Variants

Mapping Intimacies ◽

10.1101/684076 ◽

2019 ◽

Cited By ~ 2

Author(s):

Lukas Bartonek ◽

Daniel Braun ◽

Bojan Zagrovic

Keyword(s):

Physicochemical Properties ◽

Protein Sequence ◽

Pearson Correlation ◽

Protein Sequences ◽

Evolutionary Strategy ◽

Messenger Rnas ◽

Coding Sequences ◽

Universal Genetic Code ◽

Domains Of Life ◽

Sequence Profiles

AbstractFrameshifts in protein coding sequences are widely perceived as resulting in either non-functional or even deleterious protein products. Indeed, frameshifts typically lead to markedly altered protein sequences and premature stop codons. By analyzing complete proteomes from all three domains of life, we demonstrate that, in contrast, several key physicochemical properties of protein sequences exhibit significant robustness against +1 and −1 frameshifts in their mRNA coding sequences. In particular, we show that hydrophobicity profiles of many protein sequences remain largely invariant upon frameshifting. For example, over 2900 human proteins exhibit a Pearson correlation coefficient between the hydrophobicity profiles of the original and the +1-frameshifted variants greater than 0.7, despite a median sequence identity between the two of only 6.5% in this group. We observe a similar effect for protein sequence profiles of affinity for certain nucleobases, their matching with the cognate mRNA nucleobase-density profiles as well as protein sequence profiles of intrinsic disorder. Finally, we show that frameshift invariance is directly embedded in the structure of the universal genetic code and may have contributed to shaping it. Our results suggest that frameshifting may be a powerful evolutionary mechanism for creating new proteins with vastly different sequences, yet similar physicochemical properties to the proteins they originate from.Significance StatementGenetic information stored in DNA is transcribed to messenger RNAs and then read in the process of translation to produce proteins. A frameshift in the reading frame at any stage of the process typically results in a significantly different protein sequence being produced and is generally assumed to be a source of detrimental errors that biological systems need to control. Here, we show that several essential properties of many protein sequences, such as their hydrophobicity profiles, remain largely unchanged upon frameshifts. This finding suggests that frameshifting could be an effective evolutionary strategy for generating novel protein sequences, which retain the functionally relevant physicochemical properties of the sequences they derive from.

Download Full-text

Frameshifting preserves key physicochemical properties of proteins

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1911203117 ◽

2020 ◽

Vol 117 (11) ◽

pp. 5907-5912 ◽

Cited By ~ 8

Author(s):

Lukas Bartonek ◽

Daniel Braun ◽

Bojan Zagrovic

Keyword(s):

Physicochemical Properties ◽

Protein Sequence ◽

Protein Sequences ◽

Protein Coding ◽

Universal Genetic Code ◽

Altered Protein ◽

Human Proteins ◽

Domains Of Life ◽

Sequence Profiles ◽

Average Sequence Identity

Frameshifts in protein coding sequences are widely perceived as resulting in either nonfunctional or even deleterious protein products. Indeed, frameshifts typically lead to markedly altered protein sequences and premature stop codons. By analyzing complete proteomes from all three domains of life, we demonstrate that, in contrast, several key physicochemical properties of protein sequences exhibit significant robustness against +1 and −1 frameshifts. In particular, we show that hydrophobicity profiles of many protein sequences remain largely invariant upon frameshifting. For example, over 2,900 human proteins exhibit a Pearson’s correlation coefficient R between the hydrophobicity profiles of the original and the +1-frameshifted variants greater than 0.7, despite an average sequence identity between the two of only 6.5% in this group. We observe a similar effect for protein sequence profiles of affinity for certain nucleobases as well as protein sequence profiles of intrinsic disorder. Finally, analysis of significance and optimality demonstrates that frameshift stability is embedded in the structure of the universal genetic code and may have contributed to shaping it. Our results suggest that frameshifting may be a powerful evolutionary mechanism for creating new proteins with vastly different sequences, yet similar physicochemical properties to the proteins from which they originate.

Download Full-text

Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix

Scientific Reports ◽

10.1038/srep46237 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 8

Author(s):

Lulu Yu ◽

Yusen Zhang ◽

Ivan Gutman ◽

Yongtang Shi ◽

Matthias Dehmer

Keyword(s):

Physicochemical Properties ◽

Protein Sequence ◽

Sequence Comparison ◽

Protein Sequences ◽

Local Dynamic ◽

Order Information ◽

Energy Matrix ◽

Graph Energy ◽

Feature Based ◽

Protein Sequence Comparison

Abstract We develop a novel position-feature-based model for protein sequences by employing physicochemical properties of 20 amino acids and the measure of graph energy. The method puts the emphasis on sequence order information and describes local dynamic distributions of sequences, from which one can get a characteristic B-vector. Afterwards, we apply the relative entropy to the sequences representing B-vectors to measure their similarity/dissimilarity. The numerical results obtained in this study show that the proposed methods leads to meaningful results compared with competitors such as Clustal W.

Download Full-text

Ubiquitous Forbidden Order in R-group classified protein sequence of SARS-CoV-2 and other viruses

10.1101/2020.08.21.261289 ◽

2020 ◽

Author(s):

Pratibha ◽

C. Shaju ◽

Kamal

Keyword(s):

Amino Acids ◽

Protein Sequence ◽

Polypeptide Chain ◽

Random Sequence ◽

Protein Sequences ◽

Chemical Behavior ◽

Coding Sequences ◽

Linear Sequence ◽

Novel Method ◽

Insight Into

AbstractEach amino acid in a polypeptide chain has a distinctive R-group associated with it. We report here a novel method of species characterization based upon the order of these R-group classified amino acids in the linear sequence of the side chains associated with the codon triplets. In an otherwise pseudo-random sequence, we search for forbidden combinations of kth order. We applied this method to analyze the available protein sequences of various viruses including SARS-CoV-2. We found that these ubiquitous forbidden orders (UFO) are unique to each of the viruses we analyzed. This unique structure of the viruses may provide an insight into viruses’ chemical behavior and the folding patterns of the proteins. This finding may have a broad significance for the analysis of coding sequences of species in general.

Download Full-text

A Study on Host Tropism Determinants of Influenza Virus Using Machine Learning

Current Bioinformatics ◽

10.2174/1574893614666191104160927 ◽

2020 ◽

Vol 15 (2) ◽

pp. 121-134 ◽

Cited By ~ 2

Author(s):

Eunmi Kwon ◽

Myeongji Cho ◽

Hayeon Kim ◽

Hyeon S. Son

Keyword(s):

Machine Learning ◽

Amino Acids ◽

Influenza Virus ◽

Random Forest ◽

Physicochemical Properties ◽

Protein Sequences ◽

Influenza Viruses ◽

Host Tropism ◽

Post Hoc ◽

Ha Protein

Background: The host tropism determinants of influenza virus, which cause changes in the host range and increase the likelihood of interaction with specific hosts, are critical for understanding the infection and propagation of the virus in diverse host species. Methods: Six types of protein sequences of influenza viral strains isolated from three classes of hosts (avian, human, and swine) were obtained. Random forest, naïve Bayes classification, and knearest neighbor algorithms were used for host classification. The Java language was used for sequence analysis programming and identifying host-specific position markers. Results: A machine learning technique was explored to derive the physicochemical properties of amino acids used in host classification and prediction. HA protein was found to play the most important role in determining host tropism of the influenza virus, and the random forest method yielded the highest accuracy in host prediction. Conserved amino acids that exhibited host-specific differences were also selected and verified, and they were found to be useful position markers for host classification. Finally, ANOVA analysis and post-hoc testing revealed that the physicochemical properties of amino acids, comprising protein sequences combined with position markers, differed significantly among hosts. Conclusion: The host tropism determinants and position markers described in this study can be used in related research to classify, identify, and predict the hosts of influenza viruses that are currently susceptible or likely to be infected in the future.

Download Full-text

Protein Sequence Classification with Improved Extreme Learning Machine Algorithms

BioMed Research International ◽

10.1155/2014/103054 ◽

2014 ◽

Vol 2014 ◽

pp. 1-12 ◽

Cited By ~ 51

Author(s):

Jiuwen Cao ◽

Lianglin Xiong

Keyword(s):

Extreme Learning Machine ◽

Protein Sequence ◽

Protein Sequences ◽

Activation Function ◽

Majority Voting ◽

Training Algorithms ◽

Sequence Classification ◽

Protein Sequence Classification ◽

Learning Machine ◽

Majority Voting Method

Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms.

Download Full-text

EightyDVec: a method for protein sequence similarity analysis using physicochemical properties of amino acids

Computer Methods in Biomechanics and Biomedical Engineering Imaging & Visualization ◽

10.1080/21681163.2021.1956369 ◽

2021 ◽

pp. 1-11

Author(s):

Ranjeet Kumar Rout ◽

Saiyed Umer ◽

Sabha Sheikh ◽

Sanchit Sindhwani ◽

Smitarani Pati

Keyword(s):

Amino Acids ◽

Physicochemical Properties ◽

Protein Sequence ◽

Sequence Similarity ◽

Similarity Analysis ◽

Protein Sequence Similarity ◽

Sequence Similarity Analysis

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

Coupling Between Protein Level Selection and Codon Usage Optimization in the Evolution of Bacteria and Archaea

mBio ◽

10.1128/mbio.00956-14 ◽

2014 ◽

Vol 5 (2) ◽

Cited By ~ 25

Author(s):

Wenqi Ran ◽

David M. Kristensen ◽

Eugene V. Koonin

Keyword(s):

Codon Usage ◽

Protein Level ◽

Codon Usage Bias ◽

Protein Sequence ◽

Gc Content ◽

Protein Sequences ◽

Microbial Evolution ◽

Fine Tuning ◽

Selection For ◽

Genomic Gc Content

ABSTRACT The relationship between the selection affecting codon usage and selection on protein sequences of orthologous genes in diverse groups of bacteria and archaea was examined by using the Alignable Tight Genome Clusters database of prokaryote genomes. The codon usage bias is generally low, with 57.5% of the gene-specific optimal codon frequencies (F opt ) being below 0.55. This apparent weak selection on codon usage contrasts with the strong purifying selection on amino acid sequences, with 65.8% of the gene-specific dN/dS ratios being below 0.1. For most of the genomes compared, a limited but statistically significant negative correlation between F opt and dN/dS was observed, which is indicative of a link between selection on protein sequence and selection on codon usage. The strength of the coupling between the protein level selection and codon usage bias showed a strong positive correlation with the genomic GC content. Combined with previous observations on the selection for GC-rich codons in bacteria and archaea with GC-rich genomes, these findings suggest that selection for translational fine-tuning could be an important factor in microbial evolution that drives the evolution of genome GC content away from mutational equilibrium. This type of selection is particularly pronounced in slowly evolving, “high-status” genes. A significantly stronger link between the two aspects of selection is observed in free-living bacteria than in parasitic bacteria and in genes encoding metabolic enzymes and transporters than in informational genes. These differences might reflect the special importance of translational fine-tuning for the adaptability of gene expression to environmental changes. The results of this work establish the coupling between protein level selection and selection for translational optimization as a distinct and potentially important factor in microbial evolution. IMPORTANCE Selection affects the evolution of microbial genomes at many levels, including both the structure of proteins and the regulation of their production. Here we demonstrate the coupling between the selection on protein sequences and the optimization of codon usage in a broad range of bacteria and archaea. The strength of this coupling varies over a wide range and strongly and positively correlates with the genomic GC content. The cause(s) of the evolution of high GC content is a long-standing open question, given the universal mutational bias toward AT. We propose that optimization of codon usage could be one of the key factors that determine the evolution of GC-rich genomes. This work establishes the coupling between selection at the level of protein sequence and at the level of codon choice optimization as a distinct aspect of genome evolution.

Download Full-text

GEMPROT: visualization of the impact on the protein of the genetic variants found on each haplotype

Bioinformatics ◽

10.1093/bioinformatics/bty993 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2492-2494

Author(s):

Tania Cuppens ◽

Thomas E Ludwig ◽

Pascal Trouvé ◽

Emmanuelle Genin

Keyword(s):

Genetic Variants ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Supplementary Information ◽

Analysis Tool ◽

Functional Protein ◽

Key Players ◽

On Line ◽

The Impact

Abstract Summary When analyzing sequence data, genetic variants are considered one by one, taking no account of whether or not they are found in the same individual. However, variant combinations might be key players in some diseases as variants that are neutral on their own can become deleterious when associated together. GEMPROT is a new analysis tool that allows, from a phased vcf file, to visualize the consequences of the genetic variants on the protein. At the level of an individual, the program shows the variants on each of the two protein sequences and the Pfam functional protein domains. When data on several individuals are available, GEMPROT lists the haplotypes found in the sample and can compare the haplotype distributions between different sub-groups of individuals. By offering a global visualization of the gene with the genetic variants present, GEMPROT makes it possible to better understand the impact of combinations of genetic variants on the protein sequence. Availability and implementation GEMPROT is freely available at https://github.com/TaniaCuppens/GEMPROT. An on-line version is also available at http://med-laennec.univ-brest.fr/GEMPROT/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Unique function words characterize genomic proteins

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1801182115 ◽

2018 ◽

Vol 115 (26) ◽

pp. 6703-6708 ◽

Cited By ~ 6

Author(s):

Andrea Scaiewicz ◽

Michael Levitt

Keyword(s):

Domain Architecture ◽

Protein Sequences ◽

Genomic Diversity ◽

Unique Function ◽

Function Word ◽

Sequence Motif ◽

Function Words ◽

Conserved Domain ◽

Sequence Profiles ◽

Multiple Domain

Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of “words” or UFWs (57% shared), the “sentences” (MDAs) are different (1.3% shared).

Download Full-text