Relationships between amino acid sequences determined through optimum alignments, clustering, and specific distance patterns: application to a group of scorpion toxins

Genome ◽  
1992 ◽  
Vol 35 (2) ◽  
pp. 360-371 ◽  
Author(s):  
Hugh Tyson

Optimum alignment in all pairwise combinations among a group of amino acid sequences generated a distance matrix. These distances were clustered to evaluate relationships among the sequences. The degree of relationship among sequences was also evaluated by calculating specific distances from the distance matrix and examining correlations between patterns of specific distances for pairs of sequences. The sequences examined were a group of 20 amino acid sequences of scorpion toxins originally published and analyzed by M.J. Dufton and H. Rochat in 1984. Alignment gap penalties were constant for all 190 pairwise sequence alignments and were chosen after assessing the impact of changing penalties on resultant distances. The total distances generated by the 190 pairwise sequence aligments were clustered using complete (farthest neighbour) linkage. The square, symmetrical input distance matrix is analogous to diallel cross data where reciprocal and parental values are absent. Diallel analysis methods provided analogues for the distance matrix to genetical specific combining abilities, namely specific distances between all sequence pairs that are independent of the average distances shown by individual sequences. Correlation of specific distance patterns, with transformation to modified z values and a stringent probability level, were used to delineate subgroups of related sequences. These were compared with complete linkage clustering results. Excellent agreement between the two approaches was found. Three originally outlying sequences were placed within the four new subgroups.Key words: sequence alignment, specific distances, sequence relationships.

2019 ◽  
Author(s):  
Jacopo Marchi ◽  
Ezequiel A. Galpern ◽  
Rocio Espada ◽  
Diego U. Ferreiro ◽  
Aleksandra M. Walczak ◽  
...  

AbstractThe coding space of protein sequences is shaped by evolutionary constraints set by requirements of function and stability. We show that the coding space of a given protein family —the total number of sequences in that family— can be estimated using models of maximum entropy trained on multiple sequence alignments of naturally occuring amino acid sequences. We analyzed and calculated the size of three abundant repeat proteins families, whose members are large proteins made of many repetitions of conserved portions of ∼30 amino acids. While amino acid conservation at each position of the alignment explains most of the reduction of diversity relative to completely random sequences, we found that correlations between amino acid usage at different positions significantly impact that diversity. We quantified the impact of different types of correlations, functional and evolutionary, on sequence diversity. Analysis of the detailed structure of the coding space of the families revealed a rugged landscape, with many local energy minima of varying sizes with a hierarchical structure, reminiscent of fustrated energy landscapes of spin glass in physics. This clustered structure indicates a multiplicity of subtypes within each family, and suggests new strategies for protein design.


2020 ◽  
Vol 11 ◽  
Author(s):  
Kenichiro Imai ◽  
Kenta Nakai

At the time of translation, nascent proteins are thought to be sorted into their final subcellular localization sites, based on the part of their amino acid sequences (i.e., sorting or targeting signals). Thus, it is interesting to computationally recognize these signals from the amino acid sequences of any given proteins and to predict their final subcellular localization with such information, supplemented with additional information (e.g., k-mer frequency). This field has a long history and many prediction tools have been released. Even in this era of proteomic atlas at the single-cell level, researchers continue to develop new algorithms, aiming at accessing the impact of disease-causing mutations/cell type-specific alternative splicing, for example. In this article, we overview the entire field and discuss its future direction.


2020 ◽  
Author(s):  
Angela Lopez-del Rio ◽  
Maria Martin ◽  
Alexandre Perera-Lluna ◽  
Rabie Saidi

Abstract Background The use of raw amino acid sequences as input for protein-based deep learning models has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. Results We analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Our results show that padding has an effect on model performance even when there are convolutional layers implied. We propose and implement four novel types of padding the amino acid sequences. Conclusions The present study highlights the relevance of the step of padding the one-hot encoded amino acid sequences when building deep learning-based models for Enzyme Commission number prediction. The fact that this has an effect on model performance should raise awareness on the need of justifying the details of this step on future works. The code of this analysis is available at https://github.com/b2slab/padding_benchmark.


1992 ◽  
Vol 70 (4) ◽  
pp. 715-723 ◽  
Author(s):  
J. J. Pasternak ◽  
B. R. Glick

The molecular evolution of the amino acid sequences of the mature small and large subunits of ribulose-1,5-bisphosphate carboxylase/oxygense (Rubisco) was determined. The dataset for each subunit consisted of sequences from 39 different taxa of which 22 are represented with sequence information for both subunits. Phylogenetic trees were reconstructed using distance matrix, parsimony and simultaneous alignment and phylogeny methods. For the small subunit, the latter two methods produced similar trees that differed from the topology of the distance matrix tree. For the large subunit, each of the three tree-building methods yielded a distinct tree. Except for the distance matrix small subunit tree, the tree-building methods produced topologies for the small and large subunit sequences from the nonflowering plant taxa that, for the most part, agree with current taxonomic schemes. With the full datasets, the lack of consistency both among the various trees and with conventional taxonomic relationships was most evident with the Rubisco sequences from angiosperms. It is unlikely that current tree-building methods will be able to reconstruct an unambiguous molecular evolution of either of the Rubisco subunits. Molecular trees, regardless of methodology, showed similar topologies for the small and large subunits from the 22 taxa from which both subunits have been sequenced, indicating that the subunits have changed to the same extent over time. In this case, similar trees were formed because only 4 of the 22 taxa were from dicots. Key words: ribulose-1,5-bisphosphate carboxylase/oxygenase, amino acid sequence, molecular evolution, phyletic trees.


Cholesterol ◽  
2011 ◽  
Vol 2011 ◽  
pp. 1-15 ◽  
Author(s):  
Roger S. Holmes ◽  
Laura A. Cox

Bile-salt activated carboxylic ester lipase (CEL) is a major triglyceride, cholesterol ester and vitamin ester hydrolytic enzyme contained within pancreatic and lactating mammary gland secretions. Bioinformatic methods were used to predict the amino acid sequences, secondary and tertiary structures and gene locations for CEL genes, and encoded proteins using data from several vertebrate genome projects. A proline-rich and O-glycosylated 11-amino acid C-terminal repeat sequence (VNTR) previously reported for human and other higher primate CEL proteins was also observed for other eutherian mammalian CEL sequences examined. In contrast, opossum CEL contained a single C-terminal copy of this sequence whereas CEL proteins from platypus, chicken, lizard, frog and several fish species lacked the VNTR sequence. Vertebrate CEL genes contained 11 coding exons. Evidence is presented for tandem duplicated CEL genes for the zebrafish genome. Vertebrate CEL protein subunits shared 53–97% sequence identities; demonstrated sequence alignments and identities for key CEL amino acid residues; and conservation of predicted secondary and tertiary structures with those previously reported for human CEL. Phylogenetic analyses demonstrated the relationships and potential evolutionary origins of the vertebrate CEL family of genes which were related to a nematode carboxylesterase (CES) gene and five mammalian CES gene families.


Viruses ◽  
2021 ◽  
Vol 14 (1) ◽  
pp. 32
Author(s):  
Chun-Yi Lee ◽  
Yu-Ping Fang ◽  
Li-Chung Wang ◽  
Teh-Ying Chou ◽  
Hsin-Fu Liu

In this study, we investigated the molecular evolution and phylodynamics of respiratory syncytial virus (RSV) over 10 consecutive seasons (2008–2017) and the genetic variability of the RSV genotypes ON1 and BA in central Taiwan. The ectodomain region of the G gene was sequenced for genotyping. The nucleotide and deduced amino acid sequences of the second hypervariable region of the G protein in RSV ON1 and BA were analyzed. A total of 132 RSV-A and 81 RSV-B isolates were obtained. Phylogenetic analysis revealed that the NA1, ON1, and BA9 genotypes were responsible for the RSV epidemics in central Taiwan in the study period. For RSV-A, the NA1 genotype predominated during the 2008–2011 seasons. The ON1 genotype was first detected in 2011 and replaced NA1 after 2012. For RSV-B, the BA9 and BA10 genotypes cocirculated from 2008 to 2010, but the BA9 genotype has predominated since 2012. Amino acid sequence alignments revealed the continuous evolution of the G gene in the ectodomain region. The predicted N-glycosylation sites were relatively conserved in the ON1 (site 237 and 318) and BA9 (site 296 and 310) genotype strains. Our results contribute to the understanding and prediction of the temporal evolution of RSV at the local level.


2007 ◽  
Vol 05 (05) ◽  
pp. 1069-1085 ◽  
Author(s):  
MANUELE BICEGO ◽  
FRANCO DELLAGLIO ◽  
GIOVANNA E. FELIS

The crucial role played by the analysis of microbial diversity in biotechnology-based innovations has increased the interest in the microbial taxonomy research area. Phylogenetic sequence analyses have contributed significantly to the advances in this field, also in the view of the large amount of sequence data collected in recent years. Phylogenetic analyses could be realized on the basis of protein-encoding nucleotide sequences or encoded amino acid molecules: these two mechanisms present different peculiarities, still starting from two alternative representations of the same information. This complementarity could be exploited to achieve a multimodal phylogenetic scheme that is able to integrate gene and protein information in order to realize a single final tree. This aspect has been poorly addressed in the literature. In this paper, we propose to integrate the two phylogenetic analyses using basic schemes derived from the multimodality fusion theory (or multiclassifier systems theory), a well-founded and rigorous branch for which its powerfulness has already been demonstrated in other pattern recognition contexts. The proposed approach could be applied to distance matrix–based phylogenetic techniques (like neighbor joining), resulting in a smart and fast method. The proposed methodology has been tested in a real case involving sequences of some species of lactic acid bacteria. With this dataset, both nucleotide sequence– and amino acid sequence–based phylogenetic analyses present some drawbacks, which are overcome with the multimodal analysis.


2021 ◽  
Vol 12 ◽  
Author(s):  
Peter Lipsky ◽  
Patrick T. Vallano ◽  
Jeffrey Smith ◽  
Walter Owens ◽  
Daniel Snider ◽  
...  

The objective of the current work was to demonstrate the equivalence of Mylan’s glatiramer acetate (GA) to that of the reference product Copaxone® (COP) using the four criteria for active pharmaceutical ingredient sameness as established by the US Food and Drug Administration (FDA). The reaction scheme used to produce Mylan’s glatiramer acetate (MGA) was compared with that of COP, determined from publicly available literature. Comparative analyses of MGA and COP were performed for physicochemical properties such as amino acid composition and molecular weight distributions. Spectroscopic fingerprints were obtained using circular dichroism spectroscopy. Structural signatures for polymerization and depolymerization including total diethylamine (DEA) content, relative proportions of DEA-adducted amino acids, and N-and C-terminal amino acid sequences were probed with an array of highly sensitive analytical methods. Biological activity of the products was assessed using validated murine Experimental autoimmune encephalomyelitis (EAE) models of multiple sclerosis. MGA is produced using the same fundamental reaction scheme as COP and was shown to have equivalent physicochemical properties and composition. Analyses of multiple structural signatures demonstrated equivalence of MGA and COP with regard to polymerization, depolymerization, and propagational shift. Examination of the impact on prevention and treatment of EAE demonstrated equivalence of MGA and COP with respect to both activity and toxicity, and thereby provided confirmatory evidence of sameness. A rigorous, multi-pronged comparison of MGA and COP produced using an equivalent fundamental reaction scheme demonstrated equivalent physicochemical properties, structural signatures for polymerization and depolymerization, and biological activity as evidenced by comparable effects in EAE. These studies demonstrate the equivalence of MGA and COP, establishing active ingredient sameness by the US Food and Drug Administration (FDA) criteria for GA, and provide compelling evidence that the FDA-approved generic MGA can be substituted for COP for the treatment of patients with relapsing-remitting MS.


2017 ◽  
Author(s):  
Dariya K. Sydykova ◽  
Claus O Wilke

Site-specific evolutionary rates can be estimated from codon sequences or from amino-acid sequences. For codon sequences, the most popular methods use some variation of the dN/dS ratio. For amino-acid sequences, one widely-used method is called Rate4Site, and it assigns a relative conservation score to each site in an alignment. How site-wise dN/dS values relate to Rate4Site scores is not known. Here we elucidate the relationship between these two rate measurements. We simulate sequences with known dN/dS, using either dN/dS models or mutation--selection models for simulation. We then infer Rate4Site scores on the simulated alignments, and we compare those scores to either true or inferred dN/dS values on the same alignments. We find that Rate4Site scores generally correlate well with true dN/dS, and the correlation strengths increase in alignments with higher sequence divergence and higher number of taxa. Moreover, Rate4Site scores correlate nearly perfectly with inferred dN/dS values, even for small alignments with little divergence. Finally, we verify this relationship between Rate4Site and dN/dS in a variety of natural sequence alignments. We conclude that codon-level and amino-acid-level analysis frameworks are directly comparable and yield near-identical inferences.


Sign in / Sign up

Export Citation Format

Share Document