Improving intermolecular contact prediction through protein-protein interaction prediction using coevolutionary analysis with expectation-maximization

Mapping Intimacies ◽

10.1101/254789 ◽

2018 ◽

Cited By ~ 1

Author(s):

Miguel Correa Marrero ◽

Richard G.H. Immink ◽

Dick de Ridder ◽

Aalt D.J van Dijk

Keyword(s):

Mutation Analysis ◽

Protein Interaction ◽

Sequence Data ◽

Correct Identification ◽

Interacting Proteins ◽

Intermolecular Contact ◽

Sequence Alignments ◽

Contact Prediction ◽

Correlated Mutations ◽

Correlated Mutation

Predicting residue-residue contacts between interacting proteins is an important problem in bioinformatics. The growing wealth of sequence data can be used to infer these contacts through correlated mutation analysis on multiple sequence alignments of interacting homologs of the proteins of interest. This requires correct identification of pairs of interacting proteins for many species, in order to avoid introducing noise (i.e. non-interacting sequences) in the analysis that will decrease predictive performance. We have designed Ouroboros, a novel algorithm to reduce such noise in intermolecular contact prediction. Our method iterates between weighting proteins according to how likely they are to interact based on the correlated mutations signal, and predicting correlated mutations based on the weighted sequence alignment. We show that this approach accurately discriminates between protein interaction versus noninteraction and simultaneously improves the prediction of intermolecular contact residues compared to a naive application of correlated mutation analysis. Furthermore, the method relaxes the assumption of one-to-one interaction of previous approaches, allowing for the study of many-to-many interactions. Source code and test data are available at www.bif.wur.nl/

Download Full-text

Protein contact prediction using metagenome sequence data and residual neural networks

Bioinformatics ◽

10.1093/bioinformatics/btz477 ◽

2019 ◽

Vol 36 (1) ◽

pp. 41-48 ◽

Cited By ~ 15

Author(s):

Qi Wu ◽

Zhenling Peng ◽

Ivan Anishchenko ◽

Qian Cong ◽

David Baker ◽

...

Keyword(s):

Neural Networks ◽

Sequence Data ◽

Prediction Method ◽

Supplementary Information ◽

Sequence Profile ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Benchmark Datasets ◽

Metagenome Sequence

Abstract Motivation Almost all protein residue contact prediction methods rely on the availability of deep multiple sequence alignments (MSAs). However, many proteins from the poorly populated families do not have sufficient number of homologs in the conventional UniProt database. Here we aim to solve this issue by exploring the rich sequence data from the metagenome sequencing projects. Results Based on the improved MSA constructed from the metagenome sequence data, we developed MapPred, a new deep learning-based contact prediction method. MapPred consists of two component methods, DeepMSA and DeepMeta, both trained with the residual neural networks. DeepMSA was inspired by the recent method DeepCov, which was trained on 441 matrices of covariance features. By considering the symmetry of contact map, we reduced the number of matrices to 231, which makes the training more efficient in DeepMSA. Experiments show that DeepMSA outperforms DeepCov by 10–13% in precision. DeepMeta works by combining predicted contacts and other sequence profile features. Experiments on three benchmark datasets suggest that the contribution from the metagenome sequence data is significant with P-values less than 4.04E-17. MapPred is shown to be complementary and comparable the state-of-the-art methods. The success of MapPred is attributed to three factors: the deeper MSA from the metagenome sequence data, improved feature design in DeepMSA and optimized training by the residual neural networks. Availability and implementation http://yanglab.nankai.edu.cn/mappred/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Codon-level information improves predictions of inter-residue contacts in proteins by correlated mutation analysis

eLife ◽

10.7554/elife.08932 ◽

2015 ◽

Vol 4 ◽

Cited By ~ 5

Author(s):

Etai Jacob ◽

Ron Unger ◽

Amnon Horovitz

Keyword(s):

Amino Acid ◽

Amino Acid Level ◽

Sequence Alignments ◽

Multiple Sequence ◽

New Approach ◽

Multiple Sequence Alignments ◽

Residue Contacts ◽

Level Information ◽

Correlated Mutations ◽

Correlated Mutation

Methods for analysing correlated mutations in proteins are becoming an increasingly powerful tool for predicting contacts within and between proteins. Nevertheless, limitations remain due to the requirement for large multiple sequence alignments (MSA) and the fact that, in general, only the relatively small number of top-ranking predictions are reliable. To date, methods for analysing correlated mutations have relied exclusively on amino acid MSAs as inputs. Here, we describe a new approach for analysing correlated mutations that is based on combined analysis of amino acid and codon MSAs. We show that a direct contact is more likely to be present when the correlation between the positions is strong at the amino acid level but weak at the codon level. The performance of different methods for analysing correlated mutations in predicting contacts is shown to be enhanced significantly when amino acid and codon data are combined.

Download Full-text

Faculty Opinions recommendation of Optimal data collection for correlated mutation analysis.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1120110.576250 ◽

2008 ◽

Author(s):

Nathan Baker

Keyword(s):

Data Collection ◽

Mutation Analysis ◽

Correlated Mutation

Download Full-text

Faculty Opinions recommendation of Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.732011981.793542976 ◽

2018 ◽

Author(s):

Chandra Verma ◽

Suryani Lukman

Keyword(s):

Machine Learning ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Multiple Sequence Alignments

Download Full-text

Neisseria meningitidis has acquired sequences within the capsule locus by horizontal genetic transfer

Wellcome Open Research ◽

10.12688/wellcomeopenres.15333.1 ◽

2019 ◽

Vol 4 ◽

pp. 99

Author(s):

Marianne E. A. Clemence ◽

Odile B. Harrison ◽

Martin C. J. Maiden

Keyword(s):

Neisseria Meningitidis ◽

Sequence Data ◽

Whole Genome Sequence ◽

Accession Number ◽

Sequence Alignments ◽

En Bloc ◽

Genetic Transfer ◽

Diverse Range ◽

Homologous Sequences ◽

Capsule Locus

Background:Expression of a capsule from one of serogroups A, B, C, W, X or Y is usually required forNeisseria meningitidis(Nme) to cause invasive meningococcal disease. The capsule is encoded by the capsule locus,cps, which is proposed to have been acquired by a formerly capsule null organism by horizontal genetic transfer (HGT) from another species. Following identification of putative capsule genes in non-pathogenicNeisseriaspecies, this hypothesis is re-examined.Methods:Whole genome sequence data fromNeisseriaspecies, includingNmegenomes from a diverse range of clonal complexes and capsule genogroups, and non-Neisseriaspecies, were obtained from PubMLST and GenBank. Sequence alignments of genes from the meningococcalcps, and predicted orthologues in other species, were analysed using Neighbor-nets, BOOTSCANing and maximum likelihood phylogenies.Results:The meningococcalcpswas highly mosaic within regions B, C and D. A subset of sequences within regions B and C were phylogenetically nested within homologous sequences belonging toN. subflava, consistent with HGT event in whichN. subflavawas the donor. In thecpsof 23/39 isolates, the two copies of region D were highly divergent, withrfbABC’sequences being more closely related to predicted orthologues in the proposed speciesN. weixii (GenBank accession numberCP023429.1) than the same genes inNmeisolates lacking a capsule. There was also evidence of mosaicism in therfbABC’sequences of the remaining 16 isolates, as well asrfbABCfrom many isolates.Conclusions:Data are consistent with theen blocacquisition ofcpsin meningococci fromN. subflava, followed by further recombination events with otherNeisseriaspecies. Nevertheless, the data cannot refute an alternative model, in which native meningococcal capsule existed prior to undergoing HGT withN. subflavaand other species. Within-genus recombination events may have given rise to the diversity of meningococcal capsule serogroups.

Download Full-text

Optimal data collection for correlated mutation analysis

Proteins Structure Function and Bioinformatics ◽

10.1002/prot.22168 ◽

2009 ◽

Vol 74 (3) ◽

pp. 545-555 ◽

Cited By ~ 20

Author(s):

Haim Ashkenazy ◽

Ron Unger ◽

Yossef Kliger

Keyword(s):

Data Collection ◽

Mutation Analysis ◽

Correlated Mutation

Download Full-text

The visualCMAT: A web-server to select and interpret correlated mutations/co-evolving residues in protein families

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972001840005x ◽

2018 ◽

Vol 16 (02) ◽

pp. 1840005 ◽

Cited By ~ 8

Author(s):

Dmitry Suplatov ◽

Yana Sharapova ◽

Daria Timonina ◽

Kirill Kopylov ◽

Vytas Švedas

Keyword(s):

Rational Design ◽

Visual Analysis ◽

Structural Information ◽

Protein Structures ◽

Web Server ◽

Physical Contact ◽

Protein Families ◽

Sequence Alignments ◽

Homologous Proteins ◽

Correlated Mutations

The visualCMAT web-server was designed to assist experimental research in the fields of protein/enzyme biochemistry, protein engineering, and drug discovery by providing an intuitive and easy-to-use interface to the analysis of correlated mutations/co-evolving residues. Sequence and structural information describing homologous proteins are used to predict correlated substitutions by the Mutual information-based CMAT approach, classify them into spatially close co-evolving pairs, which either form a direct physical contact or interact with the same ligand (e.g. a substrate or a crystallographic water molecule), and long-range correlations, annotate and rank binding sites on the protein surface by the presence of statistically significant co-evolving positions. The results of the visualCMAT are organized for a convenient visual analysis and can be downloaded to a local computer as a content-rich all-in-one PyMol session file with multiple layers of annotation corresponding to bioinformatic, statistical and structural analyses of the predicted co-evolution, or further studied online using the built-in interactive analysis tools. The online interactivity is implemented in HTML5 and therefore neither plugins nor Java are required. The visualCMAT web-server is integrated with the Mustguseal web-server capable of constructing large structure-guided sequence alignments of protein families and superfamilies using all available information about their structures and sequences in public databases. The visualCMAT web-server can be used to understand the relationship between structure and function in proteins, implemented at selecting hotspots and compensatory mutations for rational design and directed evolution experiments to produce novel enzymes with improved properties, and employed at studying the mechanism of selective ligand’s binding and allosteric communication between topologically independent sites in protein structures. The web-server is freely available at https://biokinet.belozersky.msu.ru/visualcmat and there are no login requirements.

Download Full-text

Mitochondrial Genome Mutation Analysis: Indonesian Human mtG Comparation and Several GenBank Sequence Data on Gene Control and Encoding Regions

Journal of Data Mining in Genomics & Proteomics ◽

10.4172/2153-0602.1000215 ◽

2018 ◽

Vol 09 (01) ◽

Author(s):

Ngili Y ◽

Siallagan J ◽

Tanjung RHR ◽

Palit EIY

Keyword(s):

Mitochondrial Genome ◽

Mutation Analysis ◽

Sequence Data ◽

Gene Control

Download Full-text

VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences

Bioinformatics ◽

10.1093/bioinformatics/btz689 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jun Wang ◽

Pu-Feng Du ◽

Xin-Yu Xue ◽

Guang-Ping Li ◽

Yuan-Ke Zhou ◽

...

Keyword(s):

Sequence Data ◽

Software Tool ◽

Data Retrieval ◽

Supplementary Information ◽

Statistical Features ◽

Biological Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Source Codes ◽

Multiple Sequence Alignments

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prediction of mutation effects using a deep temporal convolutional network

Bioinformatics ◽

10.1093/bioinformatics/btz873 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2047-2052 ◽

Cited By ~ 1

Author(s):

Ha Young Kim ◽

Dongsup Kim

Keyword(s):

Latent Variable ◽

Sequence Data ◽

Generative Model ◽

Supplementary Information ◽

Biological Research ◽

Sequence Alignments ◽

Variable Model ◽

Convolutional Network ◽

Direct Optimization ◽

Multiple Sequence

Abstract Motivation Accurate prediction of the effects of genetic variation is a major goal in biological research. Towards this goal, numerous machine learning models have been developed to learn information from evolutionary sequence data. The most effective method so far is a deep generative model based on the variational autoencoder (VAE) that models the distributions using a latent variable. In this study, we propose a deep autoregressive generative model named mutationTCN, which employs dilated causal convolutions and attention mechanism for the modeling of inter-residue correlations in a biological sequence. Results We show that this model is competitive with the VAE model when tested against a set of 42 high-throughput mutation scan experiments, with the mean improvement in Spearman rank correlation ∼0.023. In particular, our model can more efficiently capture information from multiple sequence alignments with lower effective number of sequences, such as in viral sequence families, compared with the latent variable model. Also, we extend this architecture to a semi-supervised learning framework, which shows high prediction accuracy. We show that our model enables a direct optimization of the data likelihood and allows for a simple and stable training process. Availability and implementation Source code is available at https://github.com/ha01994/mutationTCN. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text