The graphical representation of protein sequences based on the physicochemical properties and its applications

Ping-An He; Yan-Ping Zhang; Yu-Hua Yao; Yi-Fa Tang; Xu-Ying Nan

doi:10.1002/jcc.21501

A Generalized Iterative Map for Analysis of Protein Sequences

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207323666201012142318 ◽

2020 ◽

Vol 23 ◽

Author(s):

Jiahe Huang ◽

Qi Dai ◽

Yuhua Yao ◽

Ping-An He

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Correlation Analysis ◽

Physicochemical Properties ◽

Graphical Representation ◽

Protein Sequences ◽

Biological Sequences ◽

Alignment Method ◽

Alignment Free ◽

Comparison Results

Aim and Objective: The similarities comparison of biological sequences is the important task in bioinformatics. The methods of the similarities comparison for biological sequences are divided into two classes: sequence alignment method and alignment-free method. The graphical representation of biological sequences is a kind of alignment-free methods, which constitutes a tool for analyzing and visualizing the biological sequences. In this article, a generalized iterative map of protein sequences was suggested to analyze the similarities of biological sequences. Materials and Methods: Based on the normalized physicochemical indexes of 20 amino acids, each amino acid can be mapped into a point in 5D space. A generalized iterative function system was introduced to outline a generalized iterative map of protein sequences, which can not only reflect various physicochemical properties of amino acids but also incorporate with different compression ratios of component of generalized iterative map. Several properties were proved to illustrate the advantage of generalized iterative map. The mathematical description of generalized iterative map was suggested to compare the similarities and dissimilarities of protein sequences. Based on this method, similarities/dissimilarities were compared among ND5 proteins sequences, as well as ND6 protein sequences of ten different species. Results: By correlation analysis, the ClustalW results were compared with our similarity/dissimilarity results and other graphical representation results to show the utility of our approach. The comparison results show that our approach has better correlations with ClustalW for all species than other approaches and illustrate the effectiveness of our approach. Conclusion: Two examples show that our method not only has good performances and effects in the similarity/dissimilarity analysis of protein sequences but also does not require complex computation.

Download Full-text

A novel graphical representation and similarity analysis of protein sequences based on physicochemical properties

Physica A Statistical Mechanics and its Applications ◽

10.1016/j.physa.2018.07.011 ◽

2018 ◽

Vol 510 ◽

pp. 477-485 ◽

Cited By ~ 3

Author(s):

Mehri Mahmoodi-Reihani ◽

Fatemeh Abbasitabar ◽

Vahid Zare-Shahabadi

Keyword(s):

Physicochemical Properties ◽

Graphical Representation ◽

Protein Sequences ◽

Similarity Analysis

Download Full-text

A Study on Host Tropism Determinants of Influenza Virus Using Machine Learning

Current Bioinformatics ◽

10.2174/1574893614666191104160927 ◽

2020 ◽

Vol 15 (2) ◽

pp. 121-134 ◽

Cited By ~ 2

Author(s):

Eunmi Kwon ◽

Myeongji Cho ◽

Hayeon Kim ◽

Hyeon S. Son

Keyword(s):

Machine Learning ◽

Amino Acids ◽

Influenza Virus ◽

Random Forest ◽

Physicochemical Properties ◽

Protein Sequences ◽

Influenza Viruses ◽

Host Tropism ◽

Post Hoc ◽

Ha Protein

Background: The host tropism determinants of influenza virus, which cause changes in the host range and increase the likelihood of interaction with specific hosts, are critical for understanding the infection and propagation of the virus in diverse host species. Methods: Six types of protein sequences of influenza viral strains isolated from three classes of hosts (avian, human, and swine) were obtained. Random forest, naïve Bayes classification, and knearest neighbor algorithms were used for host classification. The Java language was used for sequence analysis programming and identifying host-specific position markers. Results: A machine learning technique was explored to derive the physicochemical properties of amino acids used in host classification and prediction. HA protein was found to play the most important role in determining host tropism of the influenza virus, and the random forest method yielded the highest accuracy in host prediction. Conserved amino acids that exhibited host-specific differences were also selected and verified, and they were found to be useful position markers for host classification. Finally, ANOVA analysis and post-hoc testing revealed that the physicochemical properties of amino acids, comprising protein sequences combined with position markers, differed significantly among hosts. Conclusion: The host tropism determinants and position markers described in this study can be used in related research to classify, identify, and predict the hosts of influenza viruses that are currently susceptible or likely to be infected in the future.

Download Full-text

A Graphical Representation of Protein Sequences and Its Applications

Proceedings of the Fourth International Conference on Biological Information and Biomedical Engineering ◽

10.1145/3403782.3403812 ◽

2020 ◽

Author(s):

Ping-An He ◽

Linlin Yan ◽

Tianyu Zhu

Keyword(s):

Graphical Representation ◽

Protein Sequences

Download Full-text

2-D graphical representation of protein sequences and its application to coronavirus phylogeny

BMB Reports ◽

10.5483/bmbrep.2008.41.3.217 ◽

2008 ◽

Vol 41 (3) ◽

pp. 217-222 ◽

Cited By ~ 22

Author(s):

Chun Li ◽

Lili Xing ◽

Xin Wang

Keyword(s):

Graphical Representation ◽

Protein Sequences

Download Full-text

Comparative Studies Based on a 3-D Graphical Representation of Protein Sequences

Intelligent Computing Theories and Methodologies - Lecture Notes in Computer Science ◽

10.1007/978-3-319-22186-1_43 ◽

2015 ◽

pp. 436-444

Author(s):

Yingzhao Liu ◽

Yan-chun Yang ◽

Tian-ming Wang

Keyword(s):

Comparative Studies ◽

Graphical Representation ◽

Protein Sequences

Download Full-text

Similarity/Dissimilarity Analysis of Protein Sequences Based on a New Spectrum-Like Graphical Representation

Evolutionary Bioinformatics ◽

10.4137/ebo.s14713 ◽

2014 ◽

Vol 10 ◽

pp. EBO.S14713 ◽

Cited By ~ 10

Author(s):

Yuhua Yao ◽

Shoujiang Yan ◽

Huimin Xu ◽

Jianning Han ◽

Xuying Nan ◽

...

Keyword(s):

Graphical Representation ◽

Protein Sequences

Download Full-text

Measuring Similarity among Protein Sequences Using a New Descriptor

BioMed Research International ◽

10.1155/2019/2796971 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Mervat M. Abo-Elkhier ◽

Marwa A. Abd Elwahaab ◽

Moheb I. Abo El Maaty

Keyword(s):

Protein Sequence ◽

Nadh Dehydrogenase ◽

Graphical Representation ◽

Protein Sequences ◽

Computation Time ◽

Fundamental Aspect ◽

Beta Globin ◽

Nadh Dehydrogenase Subunit ◽

The Public ◽

Sequencing Technologies

The comparison of protein sequences according to similarity is a fundamental aspect of today’s biomedical research. With the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. Famous sequences’ comparison methods are alignment based. They generally give excellent results when the sequences under study are closely related and they are time consuming. Herein, a new alignment-free method is introduced. Our technique depends on a new graphical representation and descriptor. The graphical representation of protein sequence is a simple way to visualize protein sequences. The descriptor compresses the primary sequence into a single vector composed of only two values. Our approach gives good results with both short and long sequences within a little computation time. It is applied on nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 spike protein sequences. Correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others’ approaches, results, and sequence homology.

Download Full-text

Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix

Scientific Reports ◽

10.1038/srep46237 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 8

Author(s):

Lulu Yu ◽

Yusen Zhang ◽

Ivan Gutman ◽

Yongtang Shi ◽

Matthias Dehmer

Keyword(s):

Physicochemical Properties ◽

Protein Sequence ◽

Sequence Comparison ◽

Protein Sequences ◽

Local Dynamic ◽

Order Information ◽

Energy Matrix ◽

Graph Energy ◽

Feature Based ◽

Protein Sequence Comparison

Abstract We develop a novel position-feature-based model for protein sequences by employing physicochemical properties of 20 amino acids and the measure of graph energy. The method puts the emphasis on sequence order information and describes local dynamic distributions of sequences, from which one can get a characteristic B-vector. Afterwards, we apply the relative entropy to the sequences representing B-vectors to measure their similarity/dissimilarity. The numerical results obtained in this study show that the proposed methods leads to meaningful results compared with competitors such as Clustal W.

Download Full-text

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666180130100838 ◽

2018 ◽

Vol 21 (2) ◽

pp. 100-110 ◽

Cited By ~ 3

Author(s):

Chun Li ◽

Jialing Zhao ◽

Changzhong Wang ◽

Yuhua Yao

Keyword(s):

Dna Binding ◽

Protein Sequence ◽

Protein Identification ◽

Binding Proteins ◽

Graphical Representation ◽

Sequence Data ◽

Protein Sequences ◽

Dna Binding Proteins ◽

Support Vector ◽

Letter Sequence

Aim and Objective: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. Methods: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. Results: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. Conclusion: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.

Download Full-text