Protein structure without structure determination: direct coupling analysis based on in vitro evolution

Mapping Intimacies ◽

10.1101/582056 ◽

2019 ◽

Author(s):

Marco Fantini ◽

Simonetta Lisi ◽

Paolo De Los Rios ◽

Antonino Cattaneo ◽

Annalisa Pastore

Keyword(s):

Structural Information ◽

Sequence Data ◽

Protein Structures ◽

Direct Coupling ◽

In Vitro Mutagenesis ◽

Evolutionary Analysis ◽

Protein Families ◽

Coupling Analysis ◽

Direct Coupling Analysis

AbstractDirect Coupling Analysis (DCA) is a powerful technique that enables to extract structural information of proteins belonging to large protein families exclusively by in silico analysis. This method is however limited by sequence availability and various biases. Here, we propose a method that exploits molecular evolution to circumvent these limitations: instead of relying on existing protein families, we used in vitro mutagenesis of TEM-1 beta lactamase combined with in vivo functional selection to generate the sequence data necessary for evolutionary analysis. We could reconstruct by this strategy, which we called CAMELS (CouplingAnalysis byMolecularEvolutionLibrarySequencing), the lactamase fold exclusively from sequence data. Through generating and sequencing large libraries of variants, we can deal with any protein, ancient or recent, from any species, having the only constraint of setting up a functional phenotypic selection of the protein. This method allows us to obtain protein structures without solving the structure experimentally.

Download Full-text

Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008798 ◽

2021 ◽

Vol 17 (4) ◽

pp. e1008798

Author(s):

Claudio Bassot ◽

Arne Elofsson

Keyword(s):

Deep Learning ◽

Protein Structure ◽

High Accuracy ◽

Unique Sequence ◽

Direct Coupling ◽

Protein Families ◽

Coupling Analysis ◽

Repeat Proteins ◽

Eukaryotic Proteomes ◽

Direct Coupling Analysis

Repeat proteins are abundant in eukaryotic proteomes. They are involved in many eukaryotic specific functions, including signalling. For many of these proteins, the structure is not known, as they are difficult to crystallise. Today, using direct coupling analysis and deep learning it is often possible to predict a protein’s structure. However, the unique sequence features present in repeat proteins have been a challenge to use direct coupling analysis for predicting contacts. Here, we show that deep learning-based methods (trRosetta, DeepMetaPsicov (DMP) and PconsC4) overcomes this problem and can predict intra- and inter-unit contacts in repeat proteins. In a benchmark dataset of 815 repeat proteins, about 90% can be correctly modelled. Further, among 48 PFAM families lacking a protein structure, we produce models of forty-one families with estimated high accuracy.

Download Full-text

FilterDCA: interpretable supervised contact prediction using inter-domain coevolution

10.1101/2019.12.24.887877 ◽

2019 ◽

Cited By ~ 1

Author(s):

Maureen Muscat ◽

Giancarlo Croce ◽

Edoardo Sarti ◽

Martin Weigt

Keyword(s):

Deep Learning ◽

De Novo ◽

Protein Complexes ◽

Protein Structures ◽

Direct Coupling ◽

Sequence Information ◽

Coupling Analysis ◽

Contact Patterns ◽

Direct Coupling Analysis ◽

Training Sets

AbstractPredicting three-dimensional protein structure and assembling protein complexes using sequence information belongs to the most prominent tasks in computational biology. Recently substantial progress has been obtained in the case of single proteins using a combination of unsupervised coevolutionary sequence analysis with structurally supervised deep learning. While reaching impressive accuracies in predicting residue-residue contacts, deep learning has a number of disadvantages. The need for large structural training sets limits the applicability to multi-protein complexes; and their deep architecture makes the interpretability of the convolutional neural networks intrinsically hard. Here we introduce FilterDCA, a simpler supervised predictor for inter-domain and inter-protein contacts. It is based on the fact that contact maps of proteins show typical contact patterns, which results from secondary structure and are reflected by patterns in coevolutionary analysis. We explicitly integrate averaged contacts patterns with coevolutionary scores derived by Direct Coupling Analysis, reaching results comparable to more complex deep-learning approaches, while remaining fully transparent and interpretable. The FilterDCA code is available at http://gitlab.lcqb.upmc.fr/muscat/FilterDCA.Author summaryThe de novo prediction of tertiary and quaternary protein structures has recently seen important advances, by combining unsupervised, purely sequence-based coevolutionary analyses with structure-based supervision using deep learning for contact-map prediction. While showing impressive performance, deep-learning methods require large training sets and pose severe obstacles for their interpretability. Here we construct a simple, transparent and therefore fully interpretable inter-domain contact predictor, which uses the results of coevolutionary Direct Coupling Analysis in combination with explicitly constructed filters reflecting typical contact patterns in a training set of known protein structures, and which improves the accuracy of predicted contacts significantly. Our approach thereby sheds light on the question how contact information is encoded in coevolutionary signals.

Download Full-text

Expanding Direct Coupling Analysis to Identify Heterodimeric Interfaces from Limited Protein Sequence Data

The Journal of Physical Chemistry B ◽

10.1021/acs.jpcb.1c07145 ◽

2021 ◽

Author(s):

Kareem M. Mehrabiani ◽

Ryan R. Cheng ◽

José N. Onuchic

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Direct Coupling ◽

Coupling Analysis ◽

Protein Sequence Data ◽

Direct Coupling Analysis ◽

Limited Protein

Download Full-text

pydca v1.0: a comprehensive software for direct coupling analysis of RNA and protein sequences

Bioinformatics ◽

10.1093/bioinformatics/btz892 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2264-2265 ◽

Cited By ~ 3

Author(s):

Mehari B Zerihun ◽

Fabrizio Pucci ◽

Emanuel K Peter ◽

Alexander Schug

Keyword(s):

Structure Prediction ◽

Sequence Data ◽

Direct Coupling ◽

Supplementary Information ◽

Spatial Proximity ◽

Homologous Proteins ◽

Coupling Analysis ◽

Multiple Sequence ◽

Wide Range ◽

Direct Coupling Analysis

Abstract Motivation The ongoing advances in sequencing technologies have provided a massive increase in the availability of sequence data. This made it possible to study the patterns of correlated substitution between residues in families of homologous proteins or RNAs and to retrieve structural and stability information. Direct coupling analysis (DCA) infers coevolutionary couplings between pairs of residues indicating their spatial proximity, making such information a valuable input for subsequent structure prediction. Results Here, we present pydca, a standalone Python-based software package for the DCA of protein- and RNA-homologous families. It is based on two popular inverse statistical approaches, namely, the mean-field and the pseudo-likelihood maximization and is equipped with a series of functionalities that range from multiple sequence alignment trimming to contact map visualization. Thanks to its efficient implementation, features and user-friendly command line interface, pydca is a modular and easy-to-use tool that can be used by researchers with a wide range of backgrounds. Availability and implementation pydca can be obtained from https://github.com/KIT-MBS/pydca or from the Python Package Index under the MIT License. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Faculty Opinions recommendation of Direct-coupling analysis of residue coevolution captures native contacts across many protein families.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.14083956.15601056 ◽

2012 ◽

Author(s):

Nikolay Dokholyan ◽

Srinivas Ramachandran

Keyword(s):

Direct Coupling ◽

Protein Families ◽

Coupling Analysis ◽

Direct Coupling Analysis

Download Full-text

Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1615068114 ◽

2017 ◽

Vol 114 (13) ◽

pp. E2662-E2671 ◽

Cited By ~ 58

Author(s):

Guido Uguzzoni ◽

Shalini John Lovis ◽

Francesco Oteri ◽

Alexander Schug ◽

Hendrik Szurmant ◽

...

Keyword(s):

Structure Prediction ◽

Large Scale ◽

Structural Models ◽

Direct Coupling ◽

Pfam Family ◽

Sequence Information ◽

Structural Description ◽

Protein Families ◽

Coupling Analysis ◽

Direct Coupling Analysis

Proteins have evolved to perform diverse cellular functions, from serving as reaction catalysts to coordinating cellular propagation and development. Frequently, proteins do not exert their full potential as monomers but rather undergo concerted interactions as either homo-oligomers or with other proteins as hetero-oligomers. The experimental study of such protein complexes and interactions has been arduous. Theoretical structure prediction methods are an attractive alternative. Here, we investigate homo-oligomeric interfaces by tracing residue coevolution via the global statistical direct coupling analysis (DCA). DCA can accurately infer spatial adjacencies between residues. These adjacencies can be included as constraints in structure prediction techniques to predict high-resolution models. By taking advantage of the ongoing exponential growth of sequence databases, we go significantly beyond anecdotal cases of a few protein families and apply DCA to a systematic large-scale study of nearly 2,000 Pfam protein families with sufficient sequence information and structurally resolved homo-oligomeric interfaces. We find that large interfaces are commonly identified by DCA. We further demonstrate that DCA can differentiate between subfamilies with different binding modes within one large Pfam family. Sequence-derived contact information for the subfamilies proves sufficient to assemble accurate structural models of the diverse protein-oligomers. Thus, we provide an approach to investigate oligomerization for arbitrary protein families leading to structural models complementary to often-difficult experimental methods. Combined with ever more abundant sequential data, we anticipate that this study will be instrumental to allow the structural description of many heteroprotein complexes in the future.

Download Full-text

pydca v1.0: a comprehensive software for Direct Coupling Analysis of RNA and Protein Sequences

10.1101/805523 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mehari B. Zerihun ◽

Fabrizio Pucci ◽

Emanuel Karl Peter ◽

Alexander Schug

Keyword(s):

Structure Prediction ◽

Sequence Data ◽

Mean Field ◽

Direct Coupling ◽

Spatial Proximity ◽

Homologous Proteins ◽

Coupling Analysis ◽

Multiple Sequence ◽

Wide Range ◽

Direct Coupling Analysis

AbstractThe ongoing advances in sequencing technologies have provided a massive increase in the availability of sequence data. This made it possible to study the patterns of correlated substitution between residues in families of homologous proteins or RNAs and to retrieve structural and stability information. Direct coupling Analysis (DCA) infers coevolutionary couplings between pairs of residues indicating their spatial proximity, making such information a valuable input for subsequent structure prediction. Here we present pydca, a standalone Python-based software package for the DCA of protein- and RNA-homologous families. It is based on two popular inverse statistical approaches, namely, the mean-field and the pseudo-likelihood maximization and is equipped with a series of functionalities that range from multiple sequence alignment trimming to contact map visualization. Thanks to its efficient implementation, features and user-friendly command line interface, pydca is a modular and easy-to-use tool that can be used by researchers with a wide range of backgrounds.Availabilityhttps://github.com/KIT-MBS/pydca

Download Full-text

Direct-coupling analysis of residue coevolution captures native contacts across many protein families

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1111471108 ◽

2011 ◽

Vol 108 (49) ◽

pp. E1293-E1301 ◽

Cited By ~ 720

Author(s):

F. Morcos ◽

A. Pagnani ◽

B. Lunt ◽

A. Bertolino ◽

D. S. Marks ◽

...

Keyword(s):

Direct Coupling ◽

Protein Families ◽

Coupling Analysis ◽

Direct Coupling Analysis

Download Full-text

On the use of direct-coupling analysis with a reduced alphabet of amino acids combined with super-secondary structure motifs for protein fold prediction

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab027 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Bernat Anton ◽

Mireia Besalú ◽

Oriol Fornes ◽

Jaume Bonet ◽

Alexis Molina ◽

...

Keyword(s):

Amino Acids ◽

Protein Structure ◽

Secondary Structure ◽

Protein Structures ◽

Three Dimensional ◽

Direct Coupling ◽

Dimensional Structure ◽

Coupling Analysis ◽

Multiple Sequence ◽

Direct Coupling Analysis

Abstract Direct-coupling analysis (DCA) for studying the coevolution of residues in proteins has been widely used to predict the three-dimensional structure of a protein from its sequence. We present RADI/raDIMod, a variation of the original DCA algorithm that groups chemically equivalent residues combined with super-secondary structure motifs to model protein structures. Interestingly, the simplification produced by grouping amino acids into only two groups (polar and non-polar) is still representative of the physicochemical nature that characterizes the protein structure and it is in line with the role of hydrophobic forces in protein-folding funneling. As a result of a compressed alphabet, the number of sequences required for the multiple sequence alignment is reduced. The number of long-range contacts predicted is limited; therefore, our approach requires the use of neighboring sequence-positions. We use the prediction of secondary structure and motifs of super-secondary structures to predict local contacts. We use RADI and raDIMod, a fragment-based protein structure modelling, achieving near native conformations when the number of super-secondary motifs covers >30–50% of the sequence. Interestingly, although different contacts are predicted with different alphabets, they produce similar structures.

Download Full-text

Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1607570113 ◽

2016 ◽

Vol 113 (43) ◽

pp. 12186-12191 ◽

Cited By ~ 71

Author(s):

Thomas Gueudré ◽

Carlo Baldassi ◽

Marco Zamparo ◽

Martin Weigt ◽

Andrea Pagnani

Keyword(s):

Protein Interactions ◽

Multiple Scales ◽

Sequence Data ◽

Direct Coupling ◽

Physical Contact ◽

Protein Protein Interactions ◽

Homologous Proteins ◽

Coupling Analysis ◽

Large Joint ◽

Direct Coupling Analysis

Understanding protein−protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein−protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue−residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has, in turn, been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being colocalized in operons. Here we show that the direct coupling analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify interprotein residue−residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.

Download Full-text