Lemon: a framework for rapidly mining structural information from the Protein Data Bank

Mapping Intimacies ◽

10.1101/379891 ◽

2018 ◽

Author(s):

Jonathan Fine ◽

Gaurav Chopra

Keyword(s):

Protein Data Bank ◽

Structural Information ◽

Computational Cost ◽

Data Bank ◽

Structural Features ◽

Develop Software ◽

Reading Text ◽

One Stop ◽

Essential Resource ◽

3D Descriptors

AbstractMotivationThe protein data bank (PDB) currently holds over 140,000 biomolecular structures and continues to release new structures on a weekly basis. The PDB is an essential resource to the structural bioinformatics community to develop software that mine, use, categorize, and analyze such data. New computational biology methods are evaluated using custom benchmarking sets derived as subsets of 3D experimentally determined structures and structural features from the PDB. Currently, such benchmarking features are manually curated with custom scripts in a non-standardized manner that results in slow distribution and updates with new experimental structures. Finally, there is a scarcity of standardized tools to rapidly query 3D descriptors of the entire PDB.ApproachOur solution is the Lemon framework, a C++11 library with Python bindings, which provides a consistent workflow methodology for selecting biomolecular interactions based on user criterion and computing desired 3D structural features. This framework can parse and characterize the entire PDB in less than ten minutes on modern, multithreaded hardware. The speed in parsing is obtained by using the recently developed MacroMolecule Transmission Format (MMTF) to reduce the computational cost of reading text-based PDB files. The use of C++ lambda functions and Python binds provide extensive flexibility for analysis and categorization of the PDB by allowing the user to write custom functions to suite their objective. We think Lemon will become a one-stop-shop to quickly mine the entire PDB to generate desired structural biology features. The Lemon software is available as a C++ header library along with example functions at https://github.com/chopralab/lemon.

Download Full-text

Lemon: a framework for rapidly mining structural information from the Protein Data Bank

Bioinformatics ◽

10.1093/bioinformatics/btz178 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4165-4167 ◽

Cited By ~ 1

Author(s):

Jonathan Fine ◽

Gaurav Chopra

Keyword(s):

Protein Data Bank ◽

Structural Information ◽

Computational Cost ◽

Data Bank ◽

Structural Features ◽

Supplementary Information ◽

Develop Software ◽

Reading Text ◽

Essential Resource ◽

3D Descriptors

Abstract Motivation The Protein Data Bank (PDB) currently holds over 140 000 biomolecular structures and continues to release new structures on a weekly basis. The PDB is an essential resource to the structural bioinformatics community to develop software that mine, use, categorize and analyze such data. New computational biology methods are evaluated using custom benchmarking sets derived as subsets of 3D experimentally determined structures and structural features from the PDB. Currently, such benchmarking features are manually curated with custom scripts in a non-standardized manner that results in slow distribution and updates with new experimental structures. Finally, there is a scarcity of standardized tools to rapidly query 3D descriptors of the entire PDB. Results Our solution is the Lemon framework, a C++11 library with Python bindings, which provides a consistent workflow methodology for selecting biomolecular interactions based on user criterion and computing desired 3D structural features. This framework can parse and characterize the entire PDB in <10 min on modern, multithreaded hardware. The speed in parsing is obtained by using the recently developed MacroMolecule Transmission Format to reduce the computational cost of reading text-based PDB files. The use of C++ lambda functions and Python bindings provide extensive flexibility for analysis and categorization of the PDB by allowing the user to write custom functions to suite their objective. We think Lemon will become a one-stop-shop to quickly mine the entire PDB to generate desired structural biology features. Availability and implementation The Lemon software is available as a C++ header library along with a PyPI package and example functions at https://github.com/chopralab/lemon. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

THE RAMACHANDRAN MAP OF MORE THAN 6,500 PERFECT POLYPEPTIDE CHAINS

Biophysical Reviews and Letters ◽

10.1142/s1793048007000519 ◽

2007 ◽

Vol 02 (03n04) ◽

pp. 267-271

Author(s):

ZOLTÁN SZABADKA ◽

RAFAEL ÖRDÖG ◽

VINCE GROLMUSZ

Keyword(s):

Protein Data Bank ◽

Structural Information ◽

Data Bank ◽

Ramachandran Map ◽

Tenfold Increase ◽

Automated Processing ◽

Data Points ◽

Α Helix ◽

The Right ◽

Polypeptide Chains

The Protein Data Bank (PDB) is the most important depository of protein structural information, containing more than 45,000 deposited entries today. Because of its inhomogeneous structure, its fully automated processing is almost impossible. In a previous work, we cleaned and re-structured the entries in the Protein Data Bank, and from the result we have built the RS-PDB database. Using the RS-PDB database, we draw a Ramachandran-plot from 6,593 "perfect" polypeptide chains found in the PDB, containing 1,192,689 residues. This is a more than tenfold increase in the size of data analyzed before this work. The density of the data points makes it possible to draw a logarithmic heat map enhanced Ramachandran map, showing the fine inner structure of the right-handed α-helix region.

Download Full-text

High throughput processing of the structural information in the protein data bank

Journal of Molecular Graphics and Modelling ◽

10.1016/j.jmgm.2006.08.004 ◽

2007 ◽

Vol 25 (6) ◽

pp. 831-836 ◽

Cited By ~ 10

Author(s):

Zoltan Szabadka ◽

Vince Grolmusz

Keyword(s):

Protein Data Bank ◽

High Throughput ◽

Structural Information ◽

Data Bank

Download Full-text

Protein Data Bank (PDB): Database of Three-Dimensional Structural Information of Biological Macromolecules

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s0907444998009378 ◽

1998 ◽

Vol 54 (6) ◽

pp. 1078-1084 ◽

Cited By ~ 250

Author(s):

Joel L. Sussman ◽

Dawei Lin ◽

Jiansheng Jiang ◽

Nancy O. Manning ◽

Jaime Prilusky ◽

...

Keyword(s):

Nucleic Acids ◽

Protein Data Bank ◽

Structural Information ◽

National Laboratory ◽

Three Dimensional ◽

Data Bank ◽

Brookhaven National Laboratory ◽

Biological Macromolecules

The Protein Data Bank (PDB) at Brookhaven National Laboratory, is a database containing experimentally determined three-dimensional structures of proteins, nucleic acids and other biological macromolecules, with approximately 8000 entries. Data are easily submittedviaPDB's WWW-based toolAutoDep, in either mmCIF or PDB format, and are most conveniently examinedviaPDB's WWW-based tool3DB Browser.

Download Full-text

Is the growth rate of Protein Data Bank sufficient to solve the protein structure prediction problem using template-based modeling?

Bio-Algorithms and Med-Systems ◽

10.1515/bams-2014-0024 ◽

2015 ◽

Vol 11 (1) ◽

pp. 1-7 ◽

Cited By ~ 4

Author(s):

Michal Brylinski

Keyword(s):

Protein Structure ◽

Protein Data Bank ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Structural Information ◽

Three Dimensional ◽

Data Bank ◽

Prediction Problem ◽

Three Dimensional Models ◽

Protein Structure Prediction Problem

AbstractThe Protein Data Bank (PDB) undergoes an exponential expansion in terms of the number of macromolecular structures deposited every year. A pivotal question is how this rapid growth of structural information improves the quality of three-dimensional models constructed by contemporary bioinformatics approaches. To address this problem, we performed a retrospective analysis of the structural coverage of a representative set of proteins using remote homology detected by COMPASS and HHpred. We show that the number of proteins whose structures can be confidently predicted increased during a 9-year period between 2005 and 2014 on account of the PDB growth alone. Nevertheless, this encouraging trend slowed down noticeably around the year 2008 and has yielded insignificant improvements ever since. At the current pace, it is unlikely that the protein structure prediction problem will be solved in the near future using existing template-based modeling techniques. Therefore, further advances in experimental structure determination, qualitatively better approaches in fold recognition, and more accurate template-free structure prediction methods are desperately needed.

Download Full-text

Analysis of the Torsion Angles between Helical Axes in Pairs of Helices in Protein Molecules

Математическая биология и биоинформатика ◽

10.17537/2017.12.398 ◽

2017 ◽

Vol 12 (2) ◽

pp. 398-410 ◽

Cited By ~ 6

Author(s):

Д.А. Тихонов ◽

D.A. Tikhonov

Keyword(s):

Protein Data Bank ◽

Data Bank ◽

Structural Features ◽

Torsion Angles ◽

Protein Molecules ◽

Helix Packing

In this study, an analysis of distribution of the torsion angles Ω between helical axes in pairs of connected helices found in known proteins has been performed. The database for helical pairs was compiled using the Protein Data Bank taking into account the definite rules suggested earlier. The database was analyzed in order to elaborate its classification and find out novel structural features in helix packing. The database was subdivided into three subsets according to criterion of crossing helix projections on the parallel planes passing through the axes of the helices. It was shown that helical pairs not having crossing projections are distributed along whole range of angles Ω, although there are two maxima at Ω = 0° and Ω = 180°. Most of helical pairs of this subset are pairs formed by α-helices and 310- helices. It is shown that the distribution of all the helical pairs having the crossing helix projections has a maximum at 20° < Ω < 25°. In this subset, most helical pairs are formed by α-helices. The distribution of only α-helical pairs having crossing axes projections has three maxima, at –50° < Ω < –25°, 20° < Ω < 25°, and 70° < Ω < 110°.

Download Full-text

Protein Interaction Domains: structural features and drug discovery applications (part 2)

Current Medicinal Chemistry ◽

10.2174/0929867327666200114114142 ◽

2020 ◽

Vol 27 ◽

Author(s):

Marian Vincenzi ◽

Flavia Anna Mercurio ◽

Marilisa Leone

Keyword(s):

Protein Interaction ◽

Viral Infections ◽

Structural Information ◽

Data Bank ◽

Structural Features ◽

Modular Organization ◽

Modular Architecture ◽

Consensus Sequences ◽

Interaction Domains ◽

Catalytic Functions

Background: Proteins present a modular organization made up of several domains. Apart from domains playing catalytic functions, many others are crucial to recruit interactors. The latter domains can be defined "PIDs" (Protein Interaction Domains) and are responsible for pivotal outcomes in signal transduction and a certain array of normal physiological and disease-related pathways. Targeting such PIDs with small molecules and peptides able to modulate their interaction networks, may represent a valuable route to discover novel therapeutics. Objective: This work represents a continuation of a very recent review describing PIDs able to recognize post-translationally modified peptide segments. On the contrary, this second part concerns with PIDs that interact with simple peptide sequences provided with standard amino acids. Method: Crucial structural information on different domain subfamilies and their interactomes was gained by a wide search in different online available databases (including the PDB (Protein Data Bank), the Pfam (Protein family), and the SMART (Simple Modular Architecture Research Tool)). Pubmed was searched as well to explore the most recent literature related to the topic. Results and Conclusion: PIDs are multifaceted: they have all diverse structural features and can recognize several consensus sequences. PIDs can be linked to different diseases onset and progression, like cancer or viral infections and find applications in the personalized medicine field. Many efforts have been centered on peptide/peptidomimetic inhibitors of PIDs mediated interactions but much more work needs to be conducted to improve drug-likeness and interaction affinities of identified compounds.

Download Full-text

Protein Data Bank (PDB): A Database of 3D Structural Information of Biological Macromolecules

Encyclopedia of Computational Chemistry ◽

10.1002/0470845015.cpa022f ◽

2002 ◽

Author(s):

Joel L. Sussman ◽

Frances C. Bernstein ◽

Jiansheng Jiang ◽

Michael Libeson ◽

Dawei Lin ◽

...

Keyword(s):

Protein Data Bank ◽

Structural Information ◽

Data Bank ◽

Biological Macromolecules

Download Full-text

Lemon: a framework for rapidly mining structural information from the protein data bank for the development of virtual screening benchmarking sets

10.1021/scimeetings.0c06740 ◽

2020 ◽

Author(s):

Chopra Gaurav ◽

Matthew Muhoberac ◽

Jonathan Fine

Keyword(s):

Virtual Screening ◽

Protein Data Bank ◽

Structural Information ◽

Data Bank

Download Full-text

A library of coiled-coil domains: from regular bundles to peculiar twists

Bioinformatics ◽

10.1093/bioinformatics/btaa1041 ◽

2020 ◽

Author(s):

Krzysztof Szczepaniak ◽

Adriana Bukala ◽

Antonio Marinho da Silva Neto ◽

Jan Ludwiczak ◽

Stanislaw Dunin-Horkawicz

Keyword(s):

Protein Data Bank ◽

Conformational Changes ◽

Coiled Coil ◽

Data Bank ◽

Structural Features ◽

Coiled Coils ◽

Supplementary Information ◽

Numerical Representation ◽

Data Set ◽

Potential Applications

Abstract Motivation Coiled coils are widespread protein domains involved in diverse processes ranging from providing structural rigidity to the transduction of conformational changes. They comprise two or more α-helices that are wound around each other to form a regular supercoiled bundle. Owing to this regularity, coiled-coil structures can be described with parametric equations, thus enabling the numerical representation of their properties, such as the degree and handedness of supercoiling, rotational state of the helices, and the offset between them. These descriptors are invaluable in understanding the function of coiled coils and designing new structures of this type. The existing tools for such calculations require manual preparation of input and are therefore not suitable for the high-throughput analyses. Results To address this problem, we developed SamCC-Turbo, a software for fully-automated, per-residue measurement of coiled coils. By surveying Protein Data Bank with SamCC-Turbo, we generated a comprehensive atlas of ∼50,000 coiled-coil regions. This machine learning-ready data set features precise measurements as well as decomposes coiled-coil structures into fragments characterized by various degrees of supercoiling. The potential applications of SamCC-Turbo are exemplified by analyses in which we reveal general structural features of coiled coils involved in functions requiring conformational plasticity. Finally, we discuss further directions in the prediction and modeling of coiled coils. Availability SamCC-Turbo is available as a web server (https://lbs.cent.uw.edu.pl/samcc_turbo) and as a Python library (https://github.com/labstructbioinf/samcc_turbo), whereas the results of the Protein Data Bank scan can be browsed and downloaded at https://lbs.cent.uw.edu.pl/ccdb. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text