Relationships between amino acid sequences determined through optimum alignments, clustering, and specific distance patterns: application to a group of scorpion toxins

Hugh Tyson

doi:10.1139/g92-055

Relationships between amino acid sequences determined through optimum alignments, clustering, and specific distance patterns: application to a group of scorpion toxins

Genome ◽

10.1139/g92-055 ◽

1992 ◽

Vol 35 (2) ◽

pp. 360-371 ◽

Cited By ~ 6

Author(s):

Hugh Tyson

Keyword(s):

Amino Acid ◽

Diallel Analysis ◽

Distance Matrix ◽

Diallel Cross ◽

Amino Acid Sequences ◽

Scorpion Toxins ◽

Sequence Alignments ◽

Sequence Relationships ◽

The Impact ◽

Specific Distance

Optimum alignment in all pairwise combinations among a group of amino acid sequences generated a distance matrix. These distances were clustered to evaluate relationships among the sequences. The degree of relationship among sequences was also evaluated by calculating specific distances from the distance matrix and examining correlations between patterns of specific distances for pairs of sequences. The sequences examined were a group of 20 amino acid sequences of scorpion toxins originally published and analyzed by M.J. Dufton and H. Rochat in 1984. Alignment gap penalties were constant for all 190 pairwise sequence alignments and were chosen after assessing the impact of changing penalties on resultant distances. The total distances generated by the 190 pairwise sequence aligments were clustered using complete (farthest neighbour) linkage. The square, symmetrical input distance matrix is analogous to diallel cross data where reciprocal and parental values are absent. Diallel analysis methods provided analogues for the distance matrix to genetical specific combining abilities, namely specific distances between all sequence pairs that are independent of the average distances shown by individual sequences. Correlation of specific distance patterns, with transformation to modified z values and a stringent probability level, were used to delineate subgroups of related sequences. These were compared with complete linkage clustering results. Excellent agreement between the two approaches was found. Three originally outlying sequences were placed within the four new subgroups.Key words: sequence alignment, specific distances, sequence relationships.

Download Full-text

Size and structure of the sequence space of repeat proteins

10.1101/635581 ◽

2019 ◽

Author(s):

Jacopo Marchi ◽

Ezequiel A. Galpern ◽

Rocio Espada ◽

Diego U. Ferreiro ◽

Aleksandra M. Walczak ◽

...

Keyword(s):

Amino Acid ◽

Protein Design ◽

Amino Acid Sequences ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Repeat Proteins ◽

The Impact ◽

New Strategies ◽

Amino Acid Conservation

AbstractThe coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family —the total number of sequences in that family— can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of three abundant repeat proteins families, whose members are large proteins made of many repetitions of conserved portions of ∼30 amino acids. While amino acid conservation at each position of the alignment explains most of the reduction of diversity relative to completely random sequences, we found that correlations between amino acid usage at different positions significantly impact that diversity. We quantified the impact of different types of correlations, functional and evolutionary, on sequence diversity. Analysis of the detailed structure of the coding space of the families revealed a rugged landscape, with many local energy minima of varying sizes with a hierarchical structure, reminiscent of fustrated energy landscapes of spin glass in physics. This clustered structure indicates a multiplicity of subtypes within each family, and suggests new strategies for protein design.

Download Full-text

Tools for the Recognition of Sorting Signals and the Prediction of Subcellular Localization of Proteins From Their Amino Acid Sequences

Frontiers in Genetics ◽

10.3389/fgene.2020.607812 ◽

2020 ◽

Vol 11 ◽

Author(s):

Kenichiro Imai ◽

Kenta Nakai

Keyword(s):

Amino Acid ◽

Subcellular Localization ◽

Amino Acid Sequences ◽

Additional Information ◽

Sorting Signals ◽

Specific Alternative ◽

Cell Type Specific ◽

Future Direction ◽

The Impact ◽

New Algorithms

At the time of translation, nascent proteins are thought to be sorted into their final subcellular localization sites, based on the part of their amino acid sequences (i.e., sorting or targeting signals). Thus, it is interesting to computationally recognize these signals from the amino acid sequences of any given proteins and to predict their final subcellular localization with such information, supplemented with additional information (e.g., k-mer frequency). This field has a long history and many prediction tools have been released. Even in this era of proteomic atlas at the single-cell level, researchers continue to develop new algorithms, aiming at accessing the impact of disease-causing mutations/cell type-specific alternative splicing, for example. In this article, we overview the entire field and discuss its future direction.

Download Full-text

Effect of Sequence Padding on the Performance of Protein-Based Deep Learning Models

10.21203/rs.2.21336/v1 ◽

2020 ◽

Author(s):

Angela Lopez-del Rio ◽

Maria Martin ◽

Alexandre Perera-Lluna ◽

Rabie Saidi

Keyword(s):

Deep Learning ◽

Amino Acid ◽

Enzyme Commission ◽

Model Performance ◽

Enzyme Commission Number ◽

Amino Acid Sequences ◽

Learning Models ◽

Zero Padding ◽

The One ◽

The Impact

Abstract Background The use of raw amino acid sequences as input for protein-based deep learning models has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. Results We analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Our results show that padding has an effect on model performance even when there are convolutional layers implied. We propose and implement four novel types of padding the amino acid sequences. Conclusions The present study highlights the relevance of the step of padding the one-hot encoded amino acid sequences when building deep learning-based models for Enzyme Commission number prediction. The fact that this has an effect on model performance should raise awareness on the need of justifying the details of this step on future works. The code of this analysis is available at https://github.com/b2slab/padding_benchmark.

Download Full-text

Molecular evolutionary analyses of the small and large subunits of ribulose-1,5-bisphosphate carboxylase/oxygenase

Canadian Journal of Botany ◽

10.1139/b92-092 ◽

1992 ◽

Vol 70 (4) ◽

pp. 715-723 ◽

Cited By ~ 4

Author(s):

J. J. Pasternak ◽

B. R. Glick

Keyword(s):

Amino Acid ◽

Molecular Evolution ◽

Large Subunit ◽

Distance Matrix ◽

Small Subunit ◽

Amino Acid Sequences ◽

Sequence Information ◽

Bisphosphate Carboxylase ◽

Tree Building ◽

Building Methods

The molecular evolution of the amino acid sequences of the mature small and large subunits of ribulose-1,5-bisphosphate carboxylase/oxygense (Rubisco) was determined. The dataset for each subunit consisted of sequences from 39 different taxa of which 22 are represented with sequence information for both subunits. Phylogenetic trees were reconstructed using distance matrix, parsimony and simultaneous alignment and phylogeny methods. For the small subunit, the latter two methods produced similar trees that differed from the topology of the distance matrix tree. For the large subunit, each of the three tree-building methods yielded a distinct tree. Except for the distance matrix small subunit tree, the tree-building methods produced topologies for the small and large subunit sequences from the nonflowering plant taxa that, for the most part, agree with current taxonomic schemes. With the full datasets, the lack of consistency both among the various trees and with conventional taxonomic relationships was most evident with the Rubisco sequences from angiosperms. It is unlikely that current tree-building methods will be able to reconstruct an unambiguous molecular evolution of either of the Rubisco subunits. Molecular trees, regardless of methodology, showed similar topologies for the small and large subunits from the 22 taxa from which both subunits have been sequenced, indicating that the subunits have changed to the same extent over time. In this case, similar trees were formed because only 4 of the 22 taxa were from dicots. Key words: ribulose-1,5-bisphosphate carboxylase/oxygenase, amino acid sequence, molecular evolution, phyletic trees.

Download Full-text

Comparative Structures and Evolution of Vertebrate Carboxyl Ester Lipase (CEL) Genes and Proteins with a Major Role in Reverse Cholesterol Transport

Cholesterol ◽

10.1155/2011/781643 ◽

2011 ◽

Vol 2011 ◽

pp. 1-15 ◽

Cited By ~ 12

Author(s):

Roger S. Holmes ◽

Laura A. Cox

Keyword(s):

Amino Acid ◽

Phylogenetic Analyses ◽

Hydrolytic Enzyme ◽

Gene Families ◽

Repeat Sequence ◽

Amino Acid Sequences ◽

Sequence Alignments ◽

Tertiary Structures ◽

Lactating Mammary Gland ◽

Using Data

Bile-salt activated carboxylic ester lipase (CEL) is a major triglyceride, cholesterol ester and vitamin ester hydrolytic enzyme contained within pancreatic and lactating mammary gland secretions. Bioinformatic methods were used to predict the amino acid sequences, secondary and tertiary structures and gene locations for CEL genes, and encoded proteins using data from several vertebrate genome projects. A proline-rich and O-glycosylated 11-amino acid C-terminal repeat sequence (VNTR) previously reported for human and other higher primate CEL proteins was also observed for other eutherian mammalian CEL sequences examined. In contrast, opossum CEL contained a single C-terminal copy of this sequence whereas CEL proteins from platypus, chicken, lizard, frog and several fish species lacked the VNTR sequence. Vertebrate CEL genes contained 11 coding exons. Evidence is presented for tandem duplicated CEL genes for the zebrafish genome. Vertebrate CEL protein subunits shared 53–97% sequence identities; demonstrated sequence alignments and identities for key CEL amino acid residues; and conservation of predicted secondary and tertiary structures with those previously reported for human CEL. Phylogenetic analyses demonstrated the relationships and potential evolutionary origins of the vertebrate CEL family of genes which were related to a nematode carboxylesterase (CES) gene and five mammalian CES gene families.

Download Full-text

Genetic Diversity and Molecular Epidemiology of Circulating Respiratory Syncytial Virus in Central Taiwan, 2008–2017

Viruses ◽

10.3390/v14010032 ◽

2021 ◽

Vol 14 (1) ◽

pp. 32

Author(s):

Chun-Yi Lee ◽

Yu-Ping Fang ◽

Li-Chung Wang ◽

Teh-Ying Chou ◽

Hsin-Fu Liu

Keyword(s):

Respiratory Syncytial Virus ◽

Amino Acid ◽

Temporal Evolution ◽

Local Level ◽

Hypervariable Region ◽

Amino Acid Sequences ◽

Sequence Alignments ◽

Glycosylation Sites ◽

Syncytial Virus ◽

Central Taiwan

In this study, we investigated the molecular evolution and phylodynamics of respiratory syncytial virus (RSV) over 10 consecutive seasons (2008–2017) and the genetic variability of the RSV genotypes ON1 and BA in central Taiwan. The ectodomain region of the G gene was sequenced for genotyping. The nucleotide and deduced amino acid sequences of the second hypervariable region of the G protein in RSV ON1 and BA were analyzed. A total of 132 RSV-A and 81 RSV-B isolates were obtained. Phylogenetic analysis revealed that the NA1, ON1, and BA9 genotypes were responsible for the RSV epidemics in central Taiwan in the study period. For RSV-A, the NA1 genotype predominated during the 2008–2011 seasons. The ON1 genotype was first detected in 2011 and replaced NA1 after 2012. For RSV-B, the BA9 and BA10 genotypes cocirculated from 2008 to 2010, but the BA9 genotype has predominated since 2012. Amino acid sequence alignments revealed the continuous evolution of the G gene in the ectodomain region. The predicted N-glycosylation sites were relatively conserved in the ON1 (site 237 and 318) and BA9 (site 296 and 310) genotype strains. Our results contribute to the understanding and prediction of the temporal evolution of RSV at the local level.

Download Full-text

MULTIMODAL PHYLOGENY FOR TAXONOMY: INTEGRATING INFORMATION FROM NUCLEOTIDE AND AMINO ACID SEQUENCES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720007003065 ◽

2007 ◽

Vol 05 (05) ◽

pp. 1069-1085 ◽

Cited By ~ 3

Author(s):

MANUELE BICEGO ◽

FRANCO DELLAGLIO ◽

GIOVANNA E. FELIS

Keyword(s):

Amino Acid ◽

Sequence Data ◽

Phylogenetic Analyses ◽

Distance Matrix ◽

Research Area ◽

Amino Acid Sequences ◽

Fast Method ◽

Multimodal Analysis ◽

Microbial Taxonomy ◽

Fusion Theory

The crucial role played by the analysis of microbial diversity in biotechnology-based innovations has increased the interest in the microbial taxonomy research area. Phylogenetic sequence analyses have contributed significantly to the advances in this field, also in the view of the large amount of sequence data collected in recent years. Phylogenetic analyses could be realized on the basis of protein-encoding nucleotide sequences or encoded amino acid molecules: these two mechanisms present different peculiarities, still starting from two alternative representations of the same information. This complementarity could be exploited to achieve a multimodal phylogenetic scheme that is able to integrate gene and protein information in order to realize a single final tree. This aspect has been poorly addressed in the literature. In this paper, we propose to integrate the two phylogenetic analyses using basic schemes derived from the multimodality fusion theory (or multiclassifier systems theory), a well-founded and rigorous branch for which its powerfulness has already been demonstrated in other pattern recognition contexts. The proposed approach could be applied to distance matrix–based phylogenetic techniques (like neighbor joining), resulting in a smart and fast method. The proposed methodology has been tested in a real case involving sequences of some species of lactic acid bacteria. With this dataset, both nucleotide sequence– and amino acid sequence–based phylogenetic analyses present some drawbacks, which are overcome with the multimodal analysis.

Download Full-text

Alignment of nucleotide or amino acid sequences on microcomputers, using a modification of Sellers' (1974) algorithm which avoids the need for calculation of the complete distance matrix

Computer Methods and Programs in Biomedicine ◽

10.1016/0169-2607(85)90057-4 ◽

1985 ◽

Vol 21 (1) ◽

pp. 3-10 ◽

Cited By ~ 4

Author(s):

Hugh Tyson ◽

Bryan Haley

Keyword(s):

Amino Acid ◽

Distance Matrix ◽

Amino Acid Sequences

Download Full-text

Demonstration of Equivalence of Generic Glatiramer Acetate and Copaxone®

Frontiers in Pharmacology ◽

10.3389/fphar.2021.760726 ◽

2021 ◽

Vol 12 ◽

Author(s):

Peter Lipsky ◽

Patrick T. Vallano ◽

Jeffrey Smith ◽

Walter Owens ◽

Daniel Snider ◽

...

Keyword(s):

Biological Activity ◽

Amino Acid ◽

Physicochemical Properties ◽

Glatiramer Acetate ◽

Drug Administration ◽

Food And Drug Administration ◽

Reaction Scheme ◽

Amino Acid Sequences ◽

The Us ◽

The Impact

The objective of the current work was to demonstrate the equivalence of Mylan’s glatiramer acetate (GA) to that of the reference product Copaxone® (COP) using the four criteria for active pharmaceutical ingredient sameness as established by the US Food and Drug Administration (FDA). The reaction scheme used to produce Mylan’s glatiramer acetate (MGA) was compared with that of COP, determined from publicly available literature. Comparative analyses of MGA and COP were performed for physicochemical properties such as amino acid composition and molecular weight distributions. Spectroscopic fingerprints were obtained using circular dichroism spectroscopy. Structural signatures for polymerization and depolymerization including total diethylamine (DEA) content, relative proportions of DEA-adducted amino acids, and N-and C-terminal amino acid sequences were probed with an array of highly sensitive analytical methods. Biological activity of the products was assessed using validated murine Experimental autoimmune encephalomyelitis (EAE) models of multiple sclerosis. MGA is produced using the same fundamental reaction scheme as COP and was shown to have equivalent physicochemical properties and composition. Analyses of multiple structural signatures demonstrated equivalence of MGA and COP with regard to polymerization, depolymerization, and propagational shift. Examination of the impact on prevention and treatment of EAE demonstrated equivalence of MGA and COP with respect to both activity and toxicity, and thereby provided confirmatory evidence of sameness. A rigorous, multi-pronged comparison of MGA and COP produced using an equivalent fundamental reaction scheme demonstrated equivalent physicochemical properties, structural signatures for polymerization and depolymerization, and biological activity as evidenced by comparable effects in EAE. These studies demonstrate the equivalence of MGA and COP, establishing active ingredient sameness by the US Food and Drug Administration (FDA) criteria for GA, and provide compelling evidence that the FDA-approved generic MGA can be substituted for COP for the treatment of patients with relapsing-remitting MS.

Download Full-text

Calculating site-specific evolutionary rates at the amino-acid or codon level yields similar rate estimates

10.7287/peerj.preprints.2739v1 ◽

2017 ◽

Author(s):

Dariya K. Sydykova ◽

Claus O Wilke

Keyword(s):

Amino Acid ◽

Conservation Score ◽

Amino Acid Level ◽

Sequence Divergence ◽

Similar Rate ◽

Amino Acid Sequences ◽

Evolutionary Rates ◽

Sequence Alignments ◽

Site Specific ◽

Relative Conservation

Site-specific evolutionary rates can be estimated from codon sequences or from amino-acid sequences. For codon sequences, the most popular methods use some variation of the dN/dS ratio. For amino-acid sequences, one widely-used method is called Rate4Site, and it assigns a relative conservation score to each site in an alignment. How site-wise dN/dS values relate to Rate4Site scores is not known. Here we elucidate the relationship between these two rate measurements. We simulate sequences with known dN/dS, using either dN/dS models or mutation--selection models for simulation. We then infer Rate4Site scores on the simulated alignments, and we compare those scores to either true or inferred dN/dS values on the same alignments. We find that Rate4Site scores generally correlate well with true dN/dS, and the correlation strengths increase in alignments with higher sequence divergence and higher number of taxa. Moreover, Rate4Site scores correlate nearly perfectly with inferred dN/dS values, even for small alignments with little divergence. Finally, we verify this relationship between Rate4Site and dN/dS in a variety of natural sequence alignments. We conclude that codon-level and amino-acid-level analysis frameworks are directly comparable and yield near-identical inferences.

Download Full-text