Automated evaluation of quaternary structures from protein crystals

Mapping Intimacies ◽

10.1101/224717 ◽

2017 ◽

Cited By ~ 1

Author(s):

Spencer Bliven ◽

Aleix Lafita ◽

Althea Parker ◽

Guido Capitani ◽

Jose M Duarte

Keyword(s):

Protein Data Bank ◽

Quaternary Structure ◽

Protein Structures ◽

Data Bank ◽

New Method ◽

Protein Assemblies ◽

Native Solution ◽

Biologically Relevant ◽

Crystal Contacts ◽

Structure Of Proteins

AbstractA correct assessment of the quaternary structure of proteins is a fundamental prerequisite to understanding their function, physico-chemical properties and mode of interaction with other proteins. Currently about 90% of structures in the Protein Data Bank are crystal structures, in which the correct quaternary structure is embedded in the crystal lattice among a number of crystal contacts. Computational methods are required to 1) classify all protein-protein contacts in crystal lattices as biologically relevant or crystal contacts and 2) provide an assessment of how the biologically relevant interfaces combine into a biological assembly In our previous work we addressed the first problem with our EPPIC (Evolutionary Protein Protein Interface Classifier) method. Here, we present our solution to the second problem with a new method that combines the interface classification results with symmetry and topology considerations. The new algorithm enumerates all possible valid assemblies within the crystal using a graph representation of the lattice and predicts the most probable biological unit based on the pairwise interface scoring. Our method achieves 85% precision on a new dataset of 1,481 biological assemblies with consensus of PDB annotations. Although almost the same precision is achieved by PISA, currently the most popular quaternary structure assignment method, we show that, due to the fundamentally different approach to the problem, the two methods are complementary and could be combined to improve biological assembly assignments. The software for the automatic assessment of protein assemblies (EPPIC version 3) has been made available through a web server at http://www.eppic-web.org.Author summaryX-ray diffraction experiments are the main experimental technique to reveal the detailed atomic 3-dimensional structure of proteins. In these experiments, proteins are packed into crystals, an environment that is far away from their native solution environment. Determining which parts of the structure reflect the protein’s state in the cell rather than being artifacts of the crystal environment can be a difficult task. How the different protein subunits assemble together in solution is known as the quaternary structure. Finding the correct quaternary structure is important both to understand protein oligomerization and for the understanding of protein-protein interactions at large. Here we present a new method to automatically determine the quaternary structure of proteins given their crystal structure. We provide a theoretical basis for properties that correct protein assemblies should possess, and provide a systematic evaluation of all possible assemblies according to these properties. The method provides a guidance to the experimental structural biologist as well as to structural bioinformaticians analyzing protein structures in bulk. Assemblies are provided for all proteins in the Protein Data Bank through a public website and database that is updated weekly as new structures are released.

Download Full-text

Real time structural search of the Protein Data Bank

10.1101/845123 ◽

2019 ◽

Cited By ~ 1

Author(s):

Dmytro Guzenko ◽

Stephen K. Burley ◽

Jose M. Duarte

Keyword(s):

Real Time ◽

Protein Data Bank ◽

Electron Density ◽

Polypeptide Chain ◽

Protein Structures ◽

Data Bank ◽

Zernike Moment ◽

Search Problem ◽

Mathematical Tool ◽

Protein Assemblies

AbstractDetection of protein structure similarity is a central challenge in structural bioinformatics. Comparisons are usually performed at the polypeptide chain level, however the functional form of a protein within the cell is often an oligomer. This fact, together with recent growth of oligomeric structures in the Protein Data Bank (PDB), demands more efficient approaches to oligomeric assembly alignment/retrieval. Traditional methods use atom level information, which can be complicated by the presence of topological permutations within a polypeptide chain and/or subunit rearrangements. These challenges can be overcome by comparing electron density volumes directly. But, brute force alignment of 3D data is a compute intensive search problem. We developed a 3D Zernike moment normalization procedure to orient electron density volumes and assess similarity with unprecedented speed. Similarity searching with this approach enables real-time retrieval of proteins/protein assemblies resembling a target, from PDB or user input, together with resulting alignments (http://shape.rcsb.org).Author SummaryProtein structures possess wildly varied shapes, but patterns at different levels are frequently reused by nature. Finding and classifying these similarities is fundamental to understand evolution. Given the continued growth in the number of known protein structures in the Protein Data Bank, the task of comparing them to find the common patterns is becoming increasingly complicated. This is especially true when considering complete protein assemblies with several polypeptide chains, where the large sizes further complicate the issue. Here we present a novel method that can detect similarity between protein shapes and that works equally fast for any size of proteins or assemblies. The method looks at proteins as volumes of density distribution, departing from what is more usual in the field: similarity assessment based on atomic coordinates and chain connectivity. A volumetric function is amenable to be decomposed with a mathematical tool known as 3D Zernike polynomials, resulting in a compact description as vectors of Zernike moments. The tool was introduced in the 1990s, when it was suggested that the moments could be normalized to be invariant to rotations without losing information. Here we demonstrate that in fact this normalization is possible and that it offers a much more accurate method for assessing similarity between shapes, when compared to previous attempts.

Download Full-text

Automatic inference of protein quaternary structure from crystals

Journal of Applied Crystallography ◽

10.1107/s0021889803012421 ◽

2003 ◽

Vol 36 (5) ◽

pp. 1116-1122 ◽

Cited By ~ 76

Author(s):

Hannes Ponstingl ◽

Thomas Kabir ◽

Janet M. Thornton

Keyword(s):

Quaternary Structure ◽

Protein Structures ◽

Data Bank ◽

Classification Error ◽

Oligomeric State ◽

Classification Error Rate ◽

Oligomeric Proteins ◽

Oligomeric Protein ◽

Crystal Contacts ◽

Contact Size

The arrangement of the subunits in an oligomeric protein often cannot be inferred without ambiguity from crystallographic studies. The annotation of the functional assembly of protein structures in the Protein Data Bank (PDB) is incomplete and frequently inconsistent. Instructions for the reconstruction, by symmetry, of the functional assembly from the deposited coordinates are often absent. An automatic procedure is proposed for the inference of assembly structures that are likely to be physiologically relevant. The method scores crystal contacts by their contact size and chemical complementarity. The subunit assembly is then inferred from these scored contacts by a clustering procedure involving a single adjustable parameter. When predicting the oligomeric state for a non-redundant set of 55 monomeric and 163 oligomeric proteins from dimers up to hexamers, a classification error rate of 16% was observed.

Download Full-text

Enriched Conformational Sampling of DNA and Proteins with a Hybrid Hamiltonian Derived from the Protein Data Bank

International Journal of Molecular Sciences ◽

10.3390/ijms19113405 ◽

2018 ◽

Vol 19 (11) ◽

pp. 3405 ◽

Cited By ~ 3

Author(s):

Emanuel Peter ◽

Jiří Černý

Keyword(s):

Partition Function ◽

Protein Data Bank ◽

Protein Structures ◽

Data Bank ◽

Weighting Factor ◽

Potential Of Mean Force ◽

Conformational Space ◽

Dynamics Simulation ◽

Conformational Sampling ◽

Speed Increase

In this article, we present a method for the enhanced molecular dynamics simulation of protein and DNA systems called potential of mean force (PMF)-enriched sampling. The method uses partitions derived from the potentials of mean force, which we determined from DNA and protein structures in the Protein Data Bank (PDB). We define a partition function from a set of PDB-derived PMFs, which efficiently compensates for the error introduced by the assumption of a homogeneous partition function from the PDB datasets. The bias based on the PDB-derived partitions is added in the form of a hybrid Hamiltonian using a renormalization method, which adds the PMF-enriched gradient to the system depending on a linear weighting factor and the underlying force field. We validated the method using simulations of dialanine, the folding of TrpCage, and the conformational sampling of the Dickerson–Drew DNA dodecamer. Our results show the potential for the PMF-enriched simulation technique to enrich the conformational space of biomolecules along their order parameters, while we also observe a considerable speed increase in the sampling by factors ranging from 13.1 to 82. The novel method can effectively be combined with enhanced sampling or coarse-graining methods to enrich conformational sampling with a partition derived from the PDB.

Download Full-text

Conformational variability in proteins bound to single-stranded DNA: a new benchmark for new docking perspectives

10.22541/au.162040366.69255354/v1 ◽

2021 ◽

Author(s):

Dominique MIAS-LUCQUIN ◽

Isaure Chauvot de Beauchêne

Keyword(s):

Protein Data Bank ◽

Conformational Changes ◽

Molecular Interactions ◽

Protein Structures ◽

Data Bank ◽

Computational Docking ◽

Ssdna Binding ◽

Conformational Variability ◽

High Flexibility ◽

Docking Benchmark

We explored the Protein Data-Bank (PDB) to collect protein-ssDNA structures and create a multi-conformational docking benchmark including both bound and unbound protein structures. Due to ssDNA high flexibility when not bound, no ssDNA unbound structure is included. For the 143 groups identified as bound-unbound structures of the same protein , we studied the conformational changes in the protein induced by the ssDNA binding. Moreover, based on several bound or unbound protein structures in some groups, we also assessed the intrinsic conformational variability in either bound or unbound conditions, and compared it to the supposedly binding-induced modifications. This benchmark is, to our knowledge, the first attempt made to peruse available structures of protein – ssDNA interactions to such an extent, aiming to improve computational docking tools dedicated to this kind of molecular interactions.

Download Full-text

RepeatsDB in 2021: improved data and extended classification for protein tandem repeat structures

Nucleic Acids Research ◽

10.1093/nar/gkaa1097 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D452-D457

Author(s):

Lisanna Paladin ◽

Martina Bevilacqua ◽

Sara Errigo ◽

Damiano Piovesan ◽

Ivan Mičetić ◽

...

Keyword(s):

Protein Data Bank ◽

Tandem Repeat ◽

Tandem Repeats ◽

Classification Scheme ◽

Sequence Similarity ◽

Protein Structures ◽

Hierarchical Classification ◽

Structural Similarity ◽

Data Bank ◽

Similarity Class

Abstract The RepeatsDB database (URL: https://repeatsdb.org/) provides annotations and classification for protein tandem repeat structures from the Protein Data Bank (PDB). Protein tandem repeats are ubiquitous in all branches of the tree of life. The accumulation of solved repeat structures provides new possibilities for classification and detection, but also increasing the need for annotation. Here we present RepeatsDB 3.0, which addresses these challenges and presents an extended classification scheme. The major conceptual change compared to the previous version is the hierarchical classification combining top levels based solely on structural similarity (Class > Topology > Fold) with two new levels (Clan > Family) requiring sequence similarity and describing repeat motifs in collaboration with Pfam. Data growth has been addressed with improved mechanisms for browsing the classification hierarchy. A new UniProt-centric view unifies the increasingly frequent annotation of structures from identical or similar sequences. This update of RepeatsDB aligns with our commitment to develop a resource that extracts, organizes and distributes specialized information on tandem repeat protein structures.

Download Full-text

Accurate Representation of Protein-Ligand Structural Diversity in the Protein Data Bank (PDB)

International Journal of Molecular Sciences ◽

10.3390/ijms21062243 ◽

2020 ◽

Vol 21 (6) ◽

pp. 2243

Author(s):

Nicolas K. Shinada ◽

Peter Schmidtke ◽

Alexandre G. de Brevern

Keyword(s):

Protein Data Bank ◽

Protein Sequence ◽

Large Scale ◽

Protein Structures ◽

Structural Diversity ◽

Data Bank ◽

Protein Distribution ◽

Research Areas ◽

Identity Threshold ◽

Protein Sequence Identity

The number of available protein structures in the Protein Data Bank (PDB) has considerably increased in recent years. Thanks to the growth of structures and complexes, numerous large-scale studies have been done in various research areas, e.g., protein–protein, protein–DNA, or in drug discovery. While protein redundancy was only simply managed using simple protein sequence identity threshold, the similarity of protein-ligand complexes should also be considered from a structural perspective. Hence, the protein-ligand duplicates in the PDB are widely known, but were never quantitatively assessed, as they are quite complex to analyze and compare. Here, we present a specific clustering of protein-ligand structures to avoid bias found in different studies. The methodology is based on binding site superposition, and a combination of weighted Root Mean Square Deviation (RMSD) assessment and hierarchical clustering. Repeated structures of proteins of interest are highlighted and only representative conformations were conserved for a non-biased view of protein distribution. Three types of cases are described based on the number of distinct conformations identified for each complex. Defining these categories decreases by 3.84-fold the number of complexes, and offers more refined results compared to a protein sequence-based method. Widely distinct conformations were analyzed using normalized B-factors. Furthermore, a non-redundant dataset was generated for future molecular interactions analysis or virtual screening studies.

Download Full-text

Detection oftrans–cisflips and peptide-plane flips in protein structures

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s1399004715008263 ◽

2015 ◽

Vol 71 (8) ◽

pp. 1604-1614 ◽

Cited By ~ 16

Author(s):

Wouter G. Touw ◽

Robbie P. Joosten ◽

Gert Vriend

Keyword(s):

Structure Function ◽

Web Service ◽

Protein Data Bank ◽

Protein Structures ◽

Data Bank ◽

Peptide Bond ◽

Peptide Bonds ◽

Unknown Peptide ◽

Peptide Plane ◽

Service Interface

A coordinate-based method is presented to detect peptide bonds that need correction either by a peptide-plane flip or by atrans–cisinversion of the peptide bond. When applied to the whole Protein Data Bank, the method predicts 4617trans–cisflips and many thousands of hitherto unknown peptide-plane flips. A few examples are highlighted for which a correction of the peptide-plane geometry leads to a correction of the understanding of the structure–function relation. All data, including 1088 manually validated cases, are freely available and the method is available from a web server, a web-service interface and throughWHAT_CHECK.

Download Full-text

Evaluation of variability in high resolution protein structures by global distance scoring

10.1101/202028 ◽

2017 ◽

Author(s):

Risa Anzai ◽

Yoshiki Asami ◽

Waka Inoue ◽

Hina Ueno ◽

Koya Yamada ◽

...

Keyword(s):

High Resolution ◽

Global Analysis ◽

Protein Structures ◽

Data Bank ◽

Relevant Information ◽

Systematic Analysis ◽

Structure Variation ◽

Model Calculations ◽

Biologically Relevant ◽

Global Comparison

AbstractSystematic analysis of statistical and dynamical properties of proteins is critical to understanding cellular events. Extraction of biologically relevant information from a set of high-resolution structures is important because it can provide mechanistic details behind the functional properties of protein families, enabling rational comparison between families. Most of the current structure comparisons are pairwise-based, which hampers the global analysis of increasing contents in the Protein Data Bank. Additionally, pairing of protein structures introduces uncertainty with respect to reproducibility because it frequently accompanies other settings for superimposition. This study introduces intramolecular distance scoring, for the analysis of human proteins, for each of which at least several high-resolution are available. We show that the results are comprehensively used to overview advances at the atomic level exploration of each protein and protein family. This method, and the interpretation based on model calculations, provide new criteria for understanding specific and non-specific structure variation in a protein, enabling global comparison of the dynamics among a vast variety of proteins from different species.

Download Full-text

Disordered Residues and Patterns in the Protein Data Bank

Molecules ◽

10.3390/molecules25071522 ◽

2020 ◽

Vol 25 (7) ◽

pp. 1522 ◽

Cited By ~ 2

Author(s):

Mikhail Yu. Lobanov ◽

Ilya V. Likhachev ◽

Oxana V. Galzitskaya

Keyword(s):

Amino Acids ◽

Statistical Analysis ◽

Protein Data Bank ◽

Protein Structures ◽

3D Structure ◽

Data Bank ◽

Disordered Regions

We created a new library of disordered patterns and disordered residues in the Protein Data Bank (PDB). To obtain such datasets, we clustered the PDB and obtained the groups of chains with different identities and marked disordered residues. We elaborated a new procedure for finding disordered patterns and created a new version of the library. This library includes three sets of patterns: unique patterns, patterns consisting of two kinds of amino acids, and homo-repeats. Using this database, the user can: (1) find homologues in the entire Protein Data Bank; (2) perform a statistical analysis of disordered residues in protein structures; (3) search for disordered patterns and homo-repeats; (4) search for disordered regions in different chains of the same protein; (5) download clusters of protein chains with different identity from our database and library of disordered patterns; and (6) observe 3D structure interactively using MView. A new library of disordered patterns will help improve the accuracy of predictions for residues that will be structured or unstructured in a given region.

Download Full-text

Revisiting Chameleon Sequences in the Protein Data Bank

Algorithms ◽

10.3390/a11080114 ◽

2018 ◽

Vol 11 (8) ◽

pp. 114 ◽

Cited By ~ 3

Author(s):

Mihaly Mezei

Keyword(s):

Protein Data Bank ◽

Protein Structures ◽

Data Bank ◽

Secondary Structures ◽

Steady Growth ◽

Periodic Repetition

The steady growth of the Protein Data Bank (PDB) suggests the periodic repetition of searches for sequences that form different secondary structures in different protein structures; these are called chameleon sequences. This paper presents a fast (nlog(n)) algorithm for such searches and presents the results on all protein structures in the PDB. The longest such sequence found consists of 20 residues.

Download Full-text