scholarly journals UniLoc: A universal protein localization site predictor for eukaryotes and prokaryotes

2018 ◽  
Author(s):  
Hsin-Nan Lin ◽  
Ching-Tai Chen ◽  
Ting-Yi Sung ◽  
Wen-Lian Hsu

ABSTRACTThere is a growing gap between protein subcellular localization (PSL) data and protein sequence data, raising the need for computation methods to rapidly determine subcellular localizations for uncharacterized proteins. Currently, the most efficient computation method involves finding sequence-similar proteins (hereafter referred to as similar proteins) in the annotated database and transferring their annotations to the target protein. When a sequence-similarity search fails to find similar proteins, many PSL predictors adopt machine learning methods for the prediction of localization sites. We proposed a universal protein localization site predictor - UniLoc - to take advantage of implicit similarity among proteins through sequence analysis alone. The notion of related protein words is introduced to explore the localization site assignment of uncharacterized proteins. UniLoc is found to identify useful template proteins and produce reliable predictions when similar proteins were not available.

2014 ◽  
Vol 112 (3) ◽  
pp. 857-862 ◽  
Author(s):  
Yuuki Yamada ◽  
Tomohisa Kuzuyama ◽  
Mamoru Komatsu ◽  
Kazuo Shin-ya ◽  
Satoshi Omura ◽  
...  

Odoriferous terpene metabolites of bacterial origin have been known for many years. In genome-sequencedStreptomycetaceaemicroorganisms, the vast majority produces the degraded sesquiterpene alcohol geosmin. Two minor groups of bacteria do not produce geosmin, with one of these groups instead producing other sesquiterpene alcohols, whereas members of the remaining group do not produce any detectable terpenoid metabolites. Because bacterial terpene synthases typically show no significant overall sequence similarity to any other known fungal or plant terpene synthases and usually exhibit relatively low levels of mutual sequence similarity with other bacterial synthases, simple correlation of protein sequence data with the structure of the cyclized terpene product has been precluded. We have previously described a powerful search method based on the use of hidden Markov models (HMMs) and protein families database (Pfam) search that has allowed the discovery of monoterpene synthases of bacterial origin. Using an enhanced set of HMM parameters generated using a training set of 140 previously identified bacterial terpene synthase sequences, a Pfam search of 8,759,463 predicted bacterial proteins from public databases and in-house draft genome data has now revealed 262 presumptive terpene synthases. The biochemical function of a considerable number of these presumptive terpene synthase genes could be determined by expression in a specially engineered heterologousStreptomyceshost and spectroscopic identification of the resulting terpene products. In addition to a wide variety of terpenes that had been previously reported from fungal or plant sources, we have isolated and determined the complete structures of 13 previously unidentified cyclic sesquiterpenes and diterpenes.


2020 ◽  
Author(s):  
Hamid Bagheri ◽  
Robert Dyer ◽  
Andrew Severin ◽  
Hridesh Rajan

Abstract Background: Scientists around the world use NCBI’s non-redundant (NR) database to identify the taxonomic origin and functional annotation of their favorite protein sequences using BLAST. Unfortunately, due to the exponential growth of this database, many scientists do not have a good understanding of the contents of the NR database. There is a need for tools to explore the contents of large biological datasets, such as NR, to better understand the assumptions and limitations of the data they contain. Results: Protein sequence data, protein functional annotation, and taxonomic assignment from NCBI’s NR database were placed into a BoaG database, a domain-specific language and shared data science infrastructure for genomics, along with a CD-HIT clustering of all these protein sequences at different sequence similarity levels. We show that BoaG can efficiently perform queries on this large dataset to determine the average length of protein sequences and identify the most common taxonomic assignments and functional annotations. Using the clustering information, we also show that the non-redundant (NR) database has a considerable amount of annotation redundancy at the 95% similarity level. Conclusions: We implemented BoaG and provided a web-based interface to BoaG’s infrastructure that will help researchers to explore the dataset further. Researchers can submit queries and download the results or share them with others. Availability and implementation: The web-interface of the BoaG infrastructure can be accessed here: http://boa.cs.iastate.edu/boag. Please use user = boag and password = boag to login. Source code and other documentation are also provided as a GitHub repository: https://github.com/boalang/NR_Dataset.


2007 ◽  
Vol 23 (11) ◽  
pp. 1410-1417 ◽  
Author(s):  
Hagit Shatkay ◽  
Annette Höglund ◽  
Scott Brady ◽  
Torsten Blum ◽  
Pierre Dönnes ◽  
...  

2005 ◽  
Vol 03 (04) ◽  
pp. 803-819 ◽  
Author(s):  
ARUNKUMAR CHINNASAMY ◽  
WING-KIN SUNG ◽  
ANKUSH MITTAL

Due to the large volume of protein sequence data, computational methods to determine the structure class and the fold class of a protein sequence have become essential. Several techniques based on sequence similarity, Neural Networks, Support Vector Machines (SVMs), etc. have been applied. Since most of these classifiers use binary classifiers for multi-classification, there may be Nc2 classifiers required. This paper presents a framework using the Tree-Augmented Bayesian Networks (TAN) which performs multi-classification based on the theory of learning Bayesian Networks and using improved feature vector representation of (Ding et al., 2001).4 In order to enhance TAN's performance, pre-processing of data is done by feature discretization and post-processing is done by using Mean Probability Voting (MPV) scheme. The advantage of using Bayesian approach over other learning methods is that the network structure is intuitive. In addition, one can read off the TAN structure probabilities to determine the significance of each feature (say, hydrophobicity) for each class, which helps to further understand the complexity in protein structure. The experiments on the datasets used in three prominent recent works show that our approach is more accurate than other discriminative methods. The framework is implemented on the BAYESPROT web server and it is available at . More detailed results are also available on the above website.


2018 ◽  
Author(s):  
◽  
Ning Zhang

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT AUTHOR'S REQUEST.] Eukaryotic cells contain diverse subcellular organelles. These organelles form distinct functional cellular compartments where different biological processes and functions are carried out. The accurate translocation of a protein is crucial to establish and maintain cellular organization and function. Newly synthesized proteins are transported to different cellular components with the assistance of protein transport machineries and complex targeting signals. Mis-localization of proteins is often associated with metabolic disorders and diseases. Compared with experimental methods, computational prediction of protein localization, utilizing different machine learning methods, provides an efficient and effective way for studying the protein subcellular localization on the whole-proteome level. Here, we present in this dissertation the bioinformatics methods for studying protein subcellular localization. We reviewed the studies of protein subcellular transport and machine learning methods in bioinformatics, presented our work on mitochondrial protein targeting prediction in plants, summarized the ongoing development of a web-resource for protein subcellular localization, and discussed the future work and development.


Sensors ◽  
2021 ◽  
Vol 21 (6) ◽  
pp. 1955
Author(s):  
Md Jubaer Hossain Pantho ◽  
Pankaj Bhowmik ◽  
Christophe Bobda

The astounding development of optical sensing imaging technology, coupled with the impressive improvements in machine learning algorithms, has increased our ability to understand and extract information from scenic events. In most cases, Convolution neural networks (CNNs) are largely adopted to infer knowledge due to their surprising success in automation, surveillance, and many other application domains. However, the convolution operations’ overwhelming computation demand has somewhat limited their use in remote sensing edge devices. In these platforms, real-time processing remains a challenging task due to the tight constraints on resources and power. Here, the transfer and processing of non-relevant image pixels act as a bottleneck on the entire system. It is possible to overcome this bottleneck by exploiting the high bandwidth available at the sensor interface by designing a CNN inference architecture near the sensor. This paper presents an attention-based pixel processing architecture to facilitate the CNN inference near the image sensor. We propose an efficient computation method to reduce the dynamic power by decreasing the overall computation of the convolution operations. The proposed method reduces redundancies by using a hierarchical optimization approach. The approach minimizes power consumption for convolution operations by exploiting the Spatio-temporal redundancies found in the incoming feature maps and performs computations only on selected regions based on their relevance score. The proposed design addresses problems related to the mapping of computations onto an array of processing elements (PEs) and introduces a suitable network structure for communication. The PEs are highly optimized to provide low latency and power for CNN applications. While designing the model, we exploit the concepts of biological vision systems to reduce computation and energy. We prototype the model in a Virtex UltraScale+ FPGA and implement it in Application Specific Integrated Circuit (ASIC) using the TSMC 90nm technology library. The results suggest that the proposed architecture significantly reduces dynamic power consumption and achieves high-speed up surpassing existing embedded processors’ computational capabilities.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Dimitri Boeckaerts ◽  
Michiel Stock ◽  
Bjorn Criel ◽  
Hans Gerstmans ◽  
Bernard De Baets ◽  
...  

AbstractNowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.


Entropy ◽  
2021 ◽  
Vol 23 (5) ◽  
pp. 530
Author(s):  
Milton Silva ◽  
Diogo Pratas ◽  
Armando J. Pinho

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.


1980 ◽  
Vol 187 (1) ◽  
pp. 65-74 ◽  
Author(s):  
D Penny ◽  
M D Hendy ◽  
L R Foulds

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.


Sign in / Sign up

Export Citation Format

Share Document