UniLoc: A universal protein localization site predictor for eukaryotes and prokaryotes

Mapping Intimacies ◽

10.1101/252916 ◽

2018 ◽

Author(s):

Hsin-Nan Lin ◽

Ching-Tai Chen ◽

Ting-Yi Sung ◽

Wen-Lian Hsu

Keyword(s):

Sequence Data ◽

Protein Localization ◽

Sequence Similarity ◽

Computation Method ◽

Efficient Computation ◽

Protein Subcellular Localization ◽

Machine Learning Methods ◽

Site Assignment ◽

Protein Sequence Data ◽

Universal Protein

ABSTRACTThere is a growing gap between protein subcellular localization (PSL) data and protein sequence data, raising the need for computation methods to rapidly determine subcellular localizations for uncharacterized proteins. Currently, the most efficient computation method involves finding sequence-similar proteins (hereafter referred to as similar proteins) in the annotated database and transferring their annotations to the target protein. When a sequence-similarity search fails to find similar proteins, many PSL predictors adopt machine learning methods for the prediction of localization sites. We proposed a universal protein localization site predictor - UniLoc - to take advantage of implicit similarity among proteins through sequence analysis alone. The notion of related protein words is introduced to explore the localization site assignment of uncharacterized proteins. UniLoc is found to identify useful template proteins and produce reliable predictions when similar proteins were not available.

Download Full-text

Terpene synthases are widely distributed in bacteria

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1422108112 ◽

2014 ◽

Vol 112 (3) ◽

pp. 857-862 ◽

Cited By ~ 222

Author(s):

Yuuki Yamada ◽

Tomohisa Kuzuyama ◽

Mamoru Komatsu ◽

Kazuo Shin-ya ◽

Satoshi Omura ◽

...

Keyword(s):

Markov Models ◽

Sequence Data ◽

Sequence Similarity ◽

Draft Genome ◽

Terpene Synthase ◽

Terpene Synthases ◽

Bacterial Origin ◽

Genome Data ◽

Protein Sequence Data ◽

Spectroscopic Identification

Odoriferous terpene metabolites of bacterial origin have been known for many years. In genome-sequencedStreptomycetaceaemicroorganisms, the vast majority produces the degraded sesquiterpene alcohol geosmin. Two minor groups of bacteria do not produce geosmin, with one of these groups instead producing other sesquiterpene alcohols, whereas members of the remaining group do not produce any detectable terpenoid metabolites. Because bacterial terpene synthases typically show no significant overall sequence similarity to any other known fungal or plant terpene synthases and usually exhibit relatively low levels of mutual sequence similarity with other bacterial synthases, simple correlation of protein sequence data with the structure of the cyclized terpene product has been precluded. We have previously described a powerful search method based on the use of hidden Markov models (HMMs) and protein families database (Pfam) search that has allowed the discovery of monoterpene synthases of bacterial origin. Using an enhanced set of HMM parameters generated using a training set of 140 previously identified bacterial terpene synthase sequences, a Pfam search of 8,759,463 predicted bacterial proteins from public databases and in-house draft genome data has now revealed 262 presumptive terpene synthases. The biochemical function of a considerable number of these presumptive terpene synthase genes could be determined by expression in a specially engineered heterologousStreptomyceshost and spectroscopic identification of the resulting terpene products. In addition to a wide variety of terpenes that had been previously reported from fungal or plant sources, we have isolated and determined the complete structures of 13 previously unidentified cyclic sesquiterpenes and diterpenes.

Download Full-text

Comprehensive Analysis of Non Redundant Protein Database

10.21203/rs.3.rs-54568/v1 ◽

2020 ◽

Author(s):

Hamid Bagheri ◽

Robert Dyer ◽

Andrew Severin ◽

Hridesh Rajan

Keyword(s):

Functional Annotation ◽

Data Science ◽

Sequence Data ◽

Average Length ◽

Sequence Similarity ◽

Protein Sequences ◽

Taxonomic Assignment ◽

Protein Database ◽

Protein Sequence Data ◽

Taxonomic Assignments

Abstract Background: Scientists around the world use NCBI’s non-redundant (NR) database to identify the taxonomic origin and functional annotation of their favorite protein sequences using BLAST. Unfortunately, due to the exponential growth of this database, many scientists do not have a good understanding of the contents of the NR database. There is a need for tools to explore the contents of large biological datasets, such as NR, to better understand the assumptions and limitations of the data they contain. Results: Protein sequence data, protein functional annotation, and taxonomic assignment from NCBI’s NR database were placed into a BoaG database, a domain-specific language and shared data science infrastructure for genomics, along with a CD-HIT clustering of all these protein sequences at different sequence similarity levels. We show that BoaG can efficiently perform queries on this large dataset to determine the average length of protein sequences and identify the most common taxonomic assignments and functional annotations. Using the clustering information, we also show that the non-redundant (NR) database has a considerable amount of annotation redundancy at the 95% similarity level. Conclusions: We implemented BoaG and provided a web-based interface to BoaG’s infrastructure that will help researchers to explore the dataset further. Researchers can submit queries and download the results or share them with others. Availability and implementation: The web-interface of the BoaG infrastructure can be accessed here: http://boa.cs.iastate.edu/boag. Please use user = boag and password = boag to login. Source code and other documentation are also provided as a GitHub repository: https://github.com/boalang/NR_Dataset.

Download Full-text

SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data

Bioinformatics ◽

10.1093/bioinformatics/btm115 ◽

2007 ◽

Vol 23 (11) ◽

pp. 1410-1417 ◽

Cited By ~ 82

Author(s):

Hagit Shatkay ◽

Annette Höglund ◽

Scott Brady ◽

Torsten Blum ◽

Pierre Dönnes ◽

...

Keyword(s):

Subcellular Localization ◽

Protein Sequence ◽

Sequence Data ◽

High Accuracy ◽

Protein Subcellular Localization ◽

Protein Sequence Data

Download Full-text

PROTEIN STRUCTURE AND FOLD PREDICTION USING TREE-AUGMENTED NAÏVE BAYESIAN CLASSIFIER

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720005001302 ◽

2005 ◽

Vol 03 (04) ◽

pp. 803-819 ◽

Cited By ~ 36

Author(s):

ARUNKUMAR CHINNASAMY ◽

WING-KIN SUNG ◽

ANKUSH MITTAL

Keyword(s):

Protein Structure ◽

Bayesian Networks ◽

Protein Sequence ◽

Sequence Data ◽

Sequence Similarity ◽

Support Vector ◽

Protein Sequence Data ◽

Binary Classifiers ◽

Feature Discretization ◽

Multi Classification

Due to the large volume of protein sequence data, computational methods to determine the structure class and the fold class of a protein sequence have become essential. Several techniques based on sequence similarity, Neural Networks, Support Vector Machines (SVMs), etc. have been applied. Since most of these classifiers use binary classifiers for multi-classification, there may be Nc2 classifiers required. This paper presents a framework using the Tree-Augmented Bayesian Networks (TAN) which performs multi-classification based on the theory of learning Bayesian Networks and using improved feature vector representation of (Ding et al., 2001).4 In order to enhance TAN's performance, pre-processing of data is done by feature discretization and post-processing is done by using Mean Probability Voting (MPV) scheme. The advantage of using Bayesian approach over other learning methods is that the network structure is intuitive. In addition, one can read off the TAN structure probabilities to determine the significance of each feature (say, hydrophobicity) for each class, which helps to further understand the complexity in protein structure. The experiments on the datasets used in three prominent recent works show that our approach is more accurate than other discriminative methods. The framework is implemented on the BAYESPROT web server and it is available at . More detailed results are also available on the above website.

Download Full-text

Protein transport : bioinformatics methods for understanding protein subcellular localization

10.32469/10355/67922 ◽

2018 ◽

Author(s):

◽

Ning Zhang

Keyword(s):

Machine Learning ◽

Subcellular Localization ◽

Protein Transport ◽

Protein Localization ◽

Mitochondrial Protein ◽

Computational Prediction ◽

Protein Subcellular Localization ◽

Learning Methods ◽

Web Resource ◽

Machine Learning Methods

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT AUTHOR'S REQUEST.] Eukaryotic cells contain diverse subcellular organelles. These organelles form distinct functional cellular compartments where different biological processes and functions are carried out. The accurate translocation of a protein is crucial to establish and maintain cellular organization and function. Newly synthesized proteins are transported to different cellular components with the assistance of protein transport machineries and complex targeting signals. Mis-localization of proteins is often associated with metabolic disorders and diseases. Compared with experimental methods, computational prediction of protein localization, utilizing different machine learning methods, provides an efficient and effective way for studying the protein subcellular localization on the whole-proteome level. Here, we present in this dissertation the bioinformatics methods for studying protein subcellular localization. We reviewed the studies of protein subcellular transport and machine learning methods in bioinformatics, presented our work on mitochondrial protein targeting prediction in plants, summarized the ongoing development of a web-resource for protein subcellular localization, and discussed the future work and development.

Download Full-text

Towards an Efficient CNN Inference Architecture Enabling In-Sensor Processing

Sensors ◽

10.3390/s21061955 ◽

2021 ◽

Vol 21 (6) ◽

pp. 1955

Author(s):

Md Jubaer Hossain Pantho ◽

Pankaj Bhowmik ◽

Christophe Bobda

Keyword(s):

Power Consumption ◽

High Speed ◽

Image Sensor ◽

Machine Learning Algorithms ◽

Hierarchical Optimization ◽

Computation Method ◽

Optimization Approach ◽

Efficient Computation ◽

Feature Maps ◽

Dynamic Power

The astounding development of optical sensing imaging technology, coupled with the impressive improvements in machine learning algorithms, has increased our ability to understand and extract information from scenic events. In most cases, Convolution neural networks (CNNs) are largely adopted to infer knowledge due to their surprising success in automation, surveillance, and many other application domains. However, the convolution operations’ overwhelming computation demand has somewhat limited their use in remote sensing edge devices. In these platforms, real-time processing remains a challenging task due to the tight constraints on resources and power. Here, the transfer and processing of non-relevant image pixels act as a bottleneck on the entire system. It is possible to overcome this bottleneck by exploiting the high bandwidth available at the sensor interface by designing a CNN inference architecture near the sensor. This paper presents an attention-based pixel processing architecture to facilitate the CNN inference near the image sensor. We propose an efficient computation method to reduce the dynamic power by decreasing the overall computation of the convolution operations. The proposed method reduces redundancies by using a hierarchical optimization approach. The approach minimizes power consumption for convolution operations by exploiting the Spatio-temporal redundancies found in the incoming feature maps and performs computations only on selected regions based on their relevance score. The proposed design addresses problems related to the mapping of computations onto an array of processing elements (PEs) and introduces a suitable network structure for communication. The PEs are highly optimized to provide low latency and power for CNN applications. While designing the model, we exploit the concepts of biological vision systems to reduce computation and energy. We prototype the model in a Virtex UltraScale+ FPGA and implement it in Application Specific Integrated Circuit (ASIC) using the TSMC 90nm technology library. The results suggest that the proposed architecture significantly reduces dynamic power consumption and achieves high-speed up surpassing existing embedded processors’ computational capabilities.

Download Full-text

Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins

Scientific Reports ◽

10.1038/s41598-021-81063-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Dimitri Boeckaerts ◽

Michiel Stock ◽

Bjorn Criel ◽

Hans Gerstmans ◽

Bernard De Baets ◽

...

Keyword(s):

Machine Learning ◽

Predictive Model ◽

Receptor Binding ◽

Bacterial Infections ◽

Sequence Data ◽

Sequence Similarity ◽

Area Under The Curve ◽

Local Alignment ◽

Search Tool ◽

Different Levels

AbstractNowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.

Download Full-text

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Entropy ◽

10.3390/e23050530 ◽

2021 ◽

Vol 23 (5) ◽

pp. 530

Author(s):

Milton Silva ◽

Diogo Pratas ◽

Armando J. Pinho

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Specific Protein ◽

General Purpose ◽

Amino Acid Sequences ◽

Input Size ◽

Protein Sequence Data ◽

Analysis Application ◽

Straightforward Solution ◽

Human Coronaviruses

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text