scholarly journals Deep Neural Network for Protein Contact Prediction by Weighting Sequences in a Multiple Sequence Alignment

2018 ◽  
Author(s):  
Hiroyuki Fukuda ◽  
Kentaro Tomii

AbstractProtein contact prediction is a crucially important step for protein structure prediction. To predict a contact, approaches of two types are used: evolutionary coupling analysis (ECA) and supervised learning. ECA uses a large multiple sequence alignment (MSA) of homologue sequences and extract correlation information between residues. Supervised learning uses ECA analysis results as input features and can produce higher accuracy. As described herein, we present a new approach to contact prediction which can both extract correlation information and predict contacts in a supervised manner directly from MSA using a deep neural network (DNN). Using DNN, we can obtain higher accuracy than with earlier ECA methods. Simultaneously, we can weight each sequence in MSA to eliminate noise sequences automatically in a supervised way. It is expected that the combination of our method and other meta-learning methods can provide much higher accuracy of contact prediction.

2021 ◽  
Author(s):  
Yunda Si ◽  
Chengfei Yan

AlphaFold2 is expected to be able to predict protein complex structures as long as a multiple sequence alignment (MSA) of the interologs of the target protein-protein interaction (PPI) can be provided. However, preparing the MSA of protein-protein interologs is a non-trivial task. In this study, a simplified phylogeny-based approach was applied to generate the MSA of interologs, which was then used as the input of AlphaFold2 for protein complex structure prediction. Extensively benchmarked this protocol on non-redundant PPI dataset, we show complex structures of 79.5% of the bacterial PPIs and 49.8% of the eukaryotic PPIs can be successfully predicted. Considering PPIs may not be conserved in species with long evolutionary distances, we further restricted interologs in the MSA to different taxonomic ranks of the species of the target PPI in protein complex structure prediction. We found the success rates can be increased to 87.9% for the bacterial PPIs and 56.3% of the eukaryotic PPIs if interologs in the MSA are restricted to a specific taxonomic rank of the species of each target PPI. Finally, we show the optimal taxonomic ranks for protein complex structure prediction can be selected with the application of the predicted TM-scores of the output models.


2020 ◽  
Author(s):  
Fusong Ju ◽  
Jianwei Zhu ◽  
Bin Shao ◽  
Lupeng Kong ◽  
Tie-Yan Liu ◽  
...  

Protein functions are largely determined by the final details of their tertiary structures, and the structures could be accurately reconstructed based on inter-residue distances. Residue co-evolution has become the primary principle for estimating inter-residue distances since the residues in close spatial proximity tend to co-evolve. The widely-used approaches infer residue co-evolution using an indirect strategy, i.e., they first extract from the multiple sequence alignment (MSA) of query protein some handcrafted features, say, co-variance matrix, and then infer residue co-evolution using these features rather than the raw information carried by MSA. This indirect strategy always leads to considerable information loss and inaccurate estimation of inter-residue distances. Here, we report a deep neural network framework (called CopulaNet) to learn residue co-evolution directly from MSA without any handcrafted features. The CopulaNet consists of two key elements: i) an encoder to model context-specific mutation for each residue, and ii) an aggregator to model correlations among residues and thereafter infer residue co-evolutions. Using the CASP13 (the 13th Critical Assessment of Protein Structure Prediction) target proteins as representatives, we demonstrated the successful application of CopulaNet for estimating inter-residue distances and further predicting protein tertiary structure with improved accuracy and efficiency. Head-to-head comparison suggested that for 24 out of the 31 free modeling CASP13 domains, ProFOLD outperformed AlphaFold, one of the state-of-the-art prediction approaches.


2021 ◽  
Author(s):  
Liang Hong ◽  
Siqi Sun ◽  
Liangzhen Zheng ◽  
Qingxiong Tan ◽  
Yu Li

Evolutionarily related sequences provide information for the protein structure and function. Multiple sequence alignment, which includes homolog searching from large databases and sequence alignment, is efficient to dig out the information and assist protein structure and function prediction, whose efficiency has been proved by AlphaFold. Despite the existing tools for multiple sequence alignment, searching homologs from the entire UniProt is still time-consuming. Considering the success of AlphaFold, foreseeably, large- scale multiple sequence alignments against massive databases will be a trend in the field. It is very desirable to accelerate this step. Here, we propose a novel method, fastMSA, to improve the speed significantly. Our idea is orthogonal to all the previous accelerating methods. Taking advantage of the protein language model based on BERT, we propose a novel dual encoder architecture that can embed the protein sequences into a low-dimension space and filter the unrelated sequences efficiently before running BLAST. Extensive experimental results suggest that we can recall most of the homologs with a 34-fold speed-up. Moreover, our method is compatible with the downstream tasks, such as structure prediction using AlphaFold. Using multiple sequence alignments generated from our method, we have little performance compromise on the protein structure prediction with much less running time. fastMSA will effectively assist protein sequence, structure, and function analysis based on homologs and multiple sequence alignment.


2019 ◽  
Vol 36 (7) ◽  
pp. 2105-2112 ◽  
Author(s):  
Chengxin Zhang ◽  
Wei Zheng ◽  
S M Mortuza ◽  
Yang Li ◽  
Yang Zhang

Abstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation https://zhanglab.ccmb.med.umich.edu/DeepMSA/. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document