A general-purpose protein design framework based on mining sequence-structure relationships in known protein structures

Mapping Intimacies ◽

10.1101/431635 ◽

2018 ◽

Cited By ~ 2

Author(s):

Jianfu Zhou ◽

Alexandra E. Panaitiu ◽

Gevorg Grigoryan

Keyword(s):

Protein Design ◽

Structure Prediction ◽

Fluorescent Protein ◽

Protein Structures ◽

Building Blocks ◽

Data Bank ◽

General Purpose ◽

Design Framework ◽

Target Structure ◽

Sequence Structure

AbstractThe ability to routinely design functional proteins, in a targeted manner, would have enormous implications for biomedical research and therapeutic development. Computational protein design (CPD) offers the potential to fulfill this need, and though recent years have brought considerable progress in the field, major limitations remain. Current state-of-the-art approaches to CPD aim to capture the determinants of structure from physical principles. While this has led to many successful designs, it does have strong limitations associated with inaccuracies in physical modeling, such that a robust general solution to CPD has yet to be found. Here we propose a fundamentally novel design framework—one based on identifying and applying patterns of sequence-structure compatibility found in known proteins, rather than approximating them from models of inter-atomic interactions. Specifically, we systematically decompose the target structure to be designed into structural building blocks we call TERMs (tertiary motifs) and use rapid structure search against the Protein Data Bank (PDB) to identify sequence patterns associated with each TERM from known protein structures that contain it. These results are then combined to produce a sequence-level pseudo-energy model that can score any sequence for compatibility with the target structure. This model can then be used to extract the optimal-scoring sequence via combinatorial optimization or otherwise sample the sequence space predicted to be well compatible with folding to the target. Here we carry out extensive computational analyses, showing that our method, which we dub dTERMen (design with TERM energies): 1) produces native-like sequences given native crystallographic or NMR backbones, 2) produces sequence-structure compatibility scores that correlate with thermodynamic stability, and 3) is able to predict experimental success of designed sequences generated with other methods, and 4) designs sequences that are found to fold to the desired target by structure prediction more frequently than sequences designed with an atomistic method. As an experimental validation of dTERMen, we perform a total surface redesign of Red Fluorescent Protein mCherry, marking a total of 64 residues as variable. The single sequence identified as optimal by dTERMen harbors 48 mutations relative to mCherry, but nevertheless folds, is monomeric in solution, exhibits similar stability to chemical denaturation as mCherry, and even preserves the fluorescence property. Our results strongly argue that the PDB is now sufficiently large to enable proteins to be designed by using only examples of structural motifs from unrelated proteins. This is highly significant, given that the structural database will only continue to grow, and signals the possibility of a whole host of novel data-driven CPD methods. Because such methods are likely to have orthogonal strengths relative to existing techniques, they could represent an important step towards removing remaining barriers to robust CPD.

Download Full-text

A general-purpose protein design framework based on mining sequence–structure relationships in known protein structures

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1908723117 ◽

2019 ◽

Vol 117 (2) ◽

pp. 1059-1068 ◽

Cited By ~ 9

Author(s):

Jianfu Zhou ◽

Alexandra E. Panaitiu ◽

Gevorg Grigoryan

Keyword(s):

Protein Design ◽

Experimental Validation ◽

Protein Structures ◽

Data Bank ◽

General Purpose ◽

Design Framework ◽

Sequence Structure ◽

Current State ◽

Computational Analyses ◽

Physical Principles

Current state-of-the-art approaches to computational protein design (CPD) aim to capture the determinants of structure from physical principles. While this has led to many successful designs, it does have strong limitations associated with inaccuracies in physical modeling, such that a reliable general solution to CPD has yet to be found. Here, we propose a design framework—one based on identifying and applying patterns of sequence–structure compatibility found in known proteins, rather than approximating them from models of interatomic interactions. We carry out extensive computational analyses and an experimental validation for our method. Our results strongly argue that the Protein Data Bank is now sufficiently large to enable proteins to be designed by using only examples of structural motifs from unrelated proteins. Because our method is likely to have orthogonal strengths relative to existing techniques, it could represent an important step toward removing remaining barriers to robust CPD.

Download Full-text

Universal Architectural Concepts Underlying Protein Folding Patterns

Frontiers in Molecular Biosciences ◽

10.3389/fmolb.2020.612920 ◽

2021 ◽

Vol 7 ◽

Author(s):

Arun S. Konagurthu ◽

Ramanan Subramanian ◽

Lloyd Allison ◽

David Abramson ◽

Peter J. Stuckey ◽

...

Keyword(s):

Structure Prediction ◽

Protein Structures ◽

Data Bank ◽

Rational Drug Design ◽

Basis Set ◽

Sequence Structure ◽

Information Theoretic ◽

Protein Architecture ◽

Structure Correlations ◽

Rational Drug

What is the architectural “basis set” of the observed universe of protein structures? Using information-theoretic inference, we answer this question with a dictionary of 1,493 substructures—called concepts—typically at a subdomain level, based on an unbiased subset of known protein structures. Each concept represents a topologically conserved assembly of helices and strands that make contact. Any protein structure can be dissected into instances of concepts from this dictionary. We dissected the Protein Data Bank and completely inventoried all the concept instances. This yields many insights, including correlations between concepts and catalytic activities or binding sites, useful for rational drug design; local amino-acid sequence–structure correlations, useful for ab initio structure prediction methods; and information supporting the recognition and exploration of evolutionary relationships, useful for structural studies. An interactive site, Proçodic, at http://lcb.infotech.monash.edu.au/prosodic (click), provides access to and navigation of the entire dictionary of concepts and their usages, and all associated information. This report is part of a continuing programme with the goal of elucidating fundamental principles of protein architecture, in the spirit of the work of Cyrus Chothia.

Download Full-text

AlphaDesign: A de novo protein design framework based on AlphaFold

10.1101/2021.10.11.463937 ◽

2021 ◽

Author(s):

Michael Jendrusch ◽

Jan O. Korbel ◽

S. Kashif Sadiq

Keyword(s):

Protein Design ◽

Structure Prediction ◽

Structural Integrity ◽

De Novo ◽

Protein Complexes ◽

Protein Structures ◽

Higher Order ◽

Design Framework ◽

Oligomeric State ◽

Dynamics Simulations

De novo protein design is a longstanding fundamental goal of synthetic biology, but has been hindered by the difficulty in reliable prediction of accurate high-resolution protein structures from sequence. Recent advances in the accuracy of protein structure prediction methods, such as AlphaFold (AF), have facilitated proteome scale structural predictions of monomeric proteins. Here we develop AlphaDesign, a computational framework for de novo protein design that embeds AF as an oracle within an optimisable design process. Our framework enables rapid prediction of completely novel protein monomers starting from random sequences. These are shown to adopt a diverse array of folds within the known protein space. A recent and unexpected utility of AF to predict the structure of protein complexes, further allows our framework to design higher-order complexes. Subsequently a range of predictions are made for monomers, homodimers, heterodimers as well as higher-order homo-oligomers -trimers to hexamers. Our analyses also show potential for designing proteins that bind to a pre-specified target protein. Structural integrity of predicted structures is validated and confirmed by standard ab initio folding and structural analysis methods as well as more extensively by performing rigorous all-atom molecular dynamics simulations and analysing the corresponding structural flexibility, intramonomer and interfacial amino-acid contacts. These analyses demonstrate widespread maintenance of structural integrity and suggests that our framework allows for fairly accurate protein design. Strikingly, our approach also reveals the capacity of AF to predict proteins that switch conformation upon complex formation, such as involving switches from α-helices to β-sheets during amyloid filament formation. Correspondingly, when integrated into our design framework, our approach reveals de novo design of a subset of proteins that switch conformation between monomeric and oligomeric state.

Download Full-text

Universal architectural concepts underlying protein folding patterns

10.1101/480194 ◽

2018 ◽

Author(s):

Arthur M. Lesk ◽

Ramanan Subramanian ◽

Lloyd Allison ◽

David Abramson ◽

Peter J. Stuckey ◽

...

Keyword(s):

Structure Prediction ◽

Protein Structures ◽

Data Bank ◽

Rational Drug Design ◽

Basis Set ◽

Sequence Structure ◽

Information Theoretic ◽

Catalytic Activities ◽

Structure Correlations ◽

Rational Drug

ABSTRACTWhat is the architectural ‘basis set’ of the observed universe of protein structures? Using information-theoretic inference, we answer this question with a comprehensive dictionary of 1,493 substructural concepts. Each concept represents a topologically-conserved assembly of helices and strands that make contact. Any protein structure can be dissected into instances of concepts from this dictionary. We dissected the world-wide protein data bank and completely inventoried all concept instances. This yields an unprecedented source of biological insights. These include: correlations between concepts and catalytic activities or binding sites, useful for rational drug design; local amino-acid sequence–structure correlations, useful for ab initio structure prediction methods; and information supporting the recognition and exploration of evolutionary relationships, useful for structural studies. An interactive site, Proçodic, at http://lcb.infotech.monash.edu.au/prosodic (click) provides access to and navigation of the entire dictionary of concepts, and all associated information.

Download Full-text

Expanding our knowledge of the protein universe: Modelling of protein structures

Acta Crystallographica Section A Foundations and Advances ◽

10.1107/s2053273314095084 ◽

2014 ◽

Vol 70 (a1) ◽

pp. C491-C491

Author(s):

Jürgen Haas ◽

Alessandro Barbato ◽

Tobias Schmidt ◽

Steven Roth ◽

Andrew Waterhouse ◽

...

Keyword(s):

Computational Modeling ◽

Structure Prediction ◽

Structural Information ◽

Protein Structures ◽

Model Organism ◽

Data Bank ◽

Continuous Model ◽

Structure Modeling ◽

Structure Comparison ◽

Modeling And Prediction

Computational modeling and prediction of three-dimensional macromolecular structures and complexes from their sequence has been a long standing goal in structural biology. Over the last two decades, a paradigm shift has occurred: starting from a large "knowledge gap" between the huge number of protein sequences compared to a small number of experimentally known structures, today, some form of structural information – either experimental or computational – is available for the majority of amino acids encoded by common model organism genomes. Methods for structure modeling and prediction have made substantial progress of the last decades, and template based homology modeling techniques have matured to a point where they are now routinely used to complement experimental techniques. However, computational modeling and prediction techniques often fall short in accuracy compared to high-resolution experimental structures, and it is often difficult to convey the expected accuracy and structural variability of a specific model. Retrospectively assessing the quality of blind structure prediction in comparison to experimental reference structures allows benchmarking the state-of-the-art in structure prediction and identifying areas which need further development. The Critical Assessment of Structure Prediction (CASP) experiment has for the last 20 years assessed the progress in the field of protein structure modeling based on predictions for ca. 100 blind prediction targets per experiment which are carefully evaluated by human experts. The "Continuous Model EvaluatiOn" (CAMEO) project aims to provide a fully automated blind assessment for prediction servers based on weekly pre-released sequences of the Protein Data Bank PDB. CAMEO has been made possible by the development of novel scoring methods such as lDDT, which are robust against domain movements to allow for automated continuous structure comparison without human intervention.

Download Full-text

Smotifs as structural local descriptors of supersecondary elements: classification, completeness and applications

Bio-Algorithms and Med-Systems ◽

10.1515/bams-2014-0016 ◽

2014 ◽

Vol 10 (4) ◽

Author(s):

Jaume Bonet ◽

Andras Fiser ◽

Baldo Oliva ◽

Narcis Fernandez-Fuentes

Keyword(s):

Protein Design ◽

Structure Prediction ◽

Protein Structures ◽

Regular Structure ◽

Loop Structure ◽

Apparent Lack ◽

Knowledge Based ◽

Limits Of Knowledge ◽

Folding Dynamics ◽

And Function

AbstractProtein structures are made up of periodic and aperiodic structural elements (i.e., α-helices, β-strands and loops). Despite the apparent lack of regular structure, loops have specific conformations and play a central role in the folding, dynamics, and function of proteins. In this article, we reviewed our previous works in the study of protein loops as local supersecondary structural motifs or Smotifs. We reexamined our works about the structural classification of loops (ArchDB) and its application to loop structure prediction (ArchPRED), including the assessment of the limits of knowledge-based loop structure prediction methods. We finalized this article by focusing on the modular nature of proteins and how the concept of Smotifs provides a convenient and practical approach to decompose proteins into strings of concatenated Smotifs and how can this be used in computational protein design and protein structure prediction.

Download Full-text

FASPR: an open-source tool for fast and accurate protein side-chain packing

Bioinformatics ◽

10.1093/bioinformatics/btaa234 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3758-3765 ◽

Cited By ~ 6

Author(s):

Xiaoqiang Huang ◽

Robin Pearce ◽

Yang Zhang

Keyword(s):

Protein Structure ◽

Protein Design ◽

Structure Prediction ◽

Protein Structures ◽

Scoring Function ◽

Supplementary Information ◽

Side Chain ◽

Chain Packing ◽

And Function ◽

Side Chain Packing

Abstract Motivation Protein structure and function are essentially determined by how the side-chain atoms interact with each other. Thus, accurate protein side-chain packing (PSCP) is a critical step toward protein structure prediction and protein design. Despite the importance of the problem, however, the accuracy and speed of current PSCP programs are still not satisfactory. Results We present FASPR for fast and accurate PSCP by using an optimized scoring function in combination with a deterministic searching algorithm. The performance of FASPR was compared with four state-of-the-art PSCP methods (CISRR, RASP, SCATD and SCWRL4) on both native and non-native protein backbones. For the assessment on native backbones, FASPR achieved a good performance by correctly predicting 69.1% of all the side-chain dihedral angles using a stringent tolerance criterion of 20°, compared favorably with SCWRL4, CISRR, RASP and SCATD which successfully predicted 68.8%, 68.6%, 67.8% and 61.7%, respectively. Additionally, FASPR achieved the highest speed for packing the 379 test protein structures in only 34.3 s, which was significantly faster than the control methods. For the assessment on non-native backbones, FASPR showed an equivalent or better performance on I-TASSER predicted backbones and the backbones perturbed from experimental structures. Detailed analyses showed that the major advantage of FASPR lies in the optimal combination of the dead-end elimination and tree decomposition with a well optimized scoring function, which makes FASPR of practical use for both protein structure modeling and protein design studies. Availability and implementation The web server, source code and datasets are freely available at https://zhanglab.ccmb.med.umich.edu/FASPR and https://github.com/tommyhuangthu/FASPR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Energy landscapes and solved protein–folding problems

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2004.1502 ◽

2004 ◽

Vol 363 (1827) ◽

pp. 453-467 ◽

Cited By ~ 126

Author(s):

Peter G. Wolynes

Keyword(s):

Protein Folding ◽

Protein Design ◽

Structure Prediction ◽

Energy Landscape ◽

Protein Structures ◽

Energy Landscapes ◽

Folding Kinetics ◽

Energy Functions ◽

Energy Landscape Theory ◽

Novel Proteins

Energy–landscape theory has led to much progress in protein folding kinetics, protein structure prediction and protein design. Funnel landscapes describe protein folding and binding and explain how protein topology determines kinetics. Landscape–optimized energy functions based on bioinformatic input have been used to correctly predict low–resolution protein structures and also to design novel proteins automatically.

Download Full-text

Guidelines for the assembly of novel coiled-coil structures: α-sheets and α-cylinders

Biochemical Society Symposium ◽

10.1042/bss0680111 ◽

2001 ◽

Vol 68 ◽

pp. 111-123 ◽

Cited By ~ 6

Author(s):

John Walshaw ◽

Jennifer M. Shipway ◽

Derek N. Woolfson

Keyword(s):

Protein Design ◽

Protein Interactions ◽

Protein Structures ◽

Coiled Coil ◽

Data Bank ◽

Coiled Coils ◽

Heptad Repeat ◽

Protein Protein Interactions ◽

Helical Bundles ◽

Heptad Repeats

The coiled coil is a ubiquitous motif that guides many different protein-protein interactions. The accepted hallmark of coiled coils is a seven-residue (heptad) sequence repeat. The positions of this repeat are labelled a-b-c-d-e-f-g, with residues at a and d tending to be hydrophobic. Such sequences form amphipathic α-helices, which assemble into helical bundles via knobs-into-holes interdigitation of residues from neighbouring helices. We wrote an algorithm, SOCKET, to identify this packing in protein structures, and used this to gather a database of coiled-coil structures from the Protein Data Bank. Surprisingly, in addition to commonly accepted structures with a single, contiguous heptad repeat, we identified sequences with multiple, offset heptad repeats. These 'new' sequence patterns help to explain oligomer-state specification in coiled coils. Here we focus on the structural consequences for sequences with two heptad repeats offset by two residues, i.e. a/f′-b/g′-c/a′-d/b′-e/c′-f/d′-g/e′. This sets up two hydrophobic seams on opposite sides of the helix formed. We describe how such helices may combine to bury these hydrophobic surfaces in two different ways and form two distinct structures: open 'α-sheets' and closed 'α-cylinders'. We highlight these with descriptions of natural structures and outline possibilities for protein design.

Download Full-text

RamaNet: Computational de novo helical protein backbone design using a long short-term memory generative neural network

10.1101/671552 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sari Sabban ◽

Mikhail Markovsky

Keyword(s):

Neural Network ◽

Protein Data Bank ◽

Protein Design ◽

Short Term Memory ◽

De Novo ◽

Protein Structures ◽

Data Bank ◽

Protein Backbone ◽

Helical Protein ◽

Long Short Term Memory

AbstractThe ability to perform de novo protein design will allow researchers to expand the variety of available proteins. By designing synthetic structures computationally, they can utilise more structures than those available in the Protein Data Bank, design structures that are not found in nature, or direct the design of proteins to acquire a specific desired structure. While some researchers attempt to design proteins from first physical and thermodynamic principals, we decided to attempt to test whether it is possible to perform de novo helical protein design of just the backbone statistically using machine learning by building a model that uses a long short-term memory (LSTM) architecture. The LSTM model used only the ϕ and ψ angles of each residue from an augmented dataset of only helical protein structures. Though the network’s generated backbone structures were not perfect, they were idealised and evaluated post generation where the non-ideal structures were filtered out and the adequate structures kept. The results were successful in developing a logical, rigid, compact, helical protein backbone topology. This paper is a proof of concept that shows it is possible to generate a novel helical backbone topology using an LSTM neural network architecture using only the ϕ and ψ angles as features. The next step is to attempt to use these backbone topologies and sequence design them to form complete protein structures.Author summaryThis research project stemmed from the desire to expand the pool of protein structures that can be used as scaffolds in computational vaccine development, since the number of structures available from the Protein Data Bank was not sufficient to allow for great diversity and increase the probability of grafting a target motif onto a protein scaffold. Since a protein structure’s backbone can be defined by the ϕ and ψ angles of each amino acid in the polypeptide and can effectively translate a protein’s 3D structure into a table of numbers, and since protein structures are not random, this numerical representation of protein structures can be used to train a neural network to mathematically generalise what a protein structure is, and therefore generate new a protein backbone. Instead of using all proteins in the Protein Data Bank a curated dataset was used encompassing protein structures with specific characteristics that will, theoretically, allow them to be evaluated computationally. This paper details how a trained neural network was able to successfully generate helical protein backbones.

Download Full-text