scholarly journals Lemon: a framework for rapidly mining structural information from the Protein Data Bank

2018 ◽  
Author(s):  
Jonathan Fine ◽  
Gaurav Chopra

AbstractMotivationThe protein data bank (PDB) currently holds over 140,000 biomolecular structures and continues to release new structures on a weekly basis. The PDB is an essential resource to the structural bioinformatics community to develop software that mine, use, categorize, and analyze such data. New computational biology methods are evaluated using custom benchmarking sets derived as subsets of 3D experimentally determined structures and structural features from the PDB. Currently, such benchmarking features are manually curated with custom scripts in a non-standardized manner that results in slow distribution and updates with new experimental structures. Finally, there is a scarcity of standardized tools to rapidly query 3D descriptors of the entire PDB.ApproachOur solution is the Lemon framework, a C++11 library with Python bindings, which provides a consistent workflow methodology for selecting biomolecular interactions based on user criterion and computing desired 3D structural features. This framework can parse and characterize the entire PDB in less than ten minutes on modern, multithreaded hardware. The speed in parsing is obtained by using the recently developed MacroMolecule Transmission Format (MMTF) to reduce the computational cost of reading text-based PDB files. The use of C++ lambda functions and Python binds provide extensive flexibility for analysis and categorization of the PDB by allowing the user to write custom functions to suite their objective. We think Lemon will become a one-stop-shop to quickly mine the entire PDB to generate desired structural biology features. The Lemon software is available as a C++ header library along with example functions at https://github.com/chopralab/lemon.

2019 ◽  
Vol 35 (20) ◽  
pp. 4165-4167 ◽  
Author(s):  
Jonathan Fine ◽  
Gaurav Chopra

Abstract Motivation The Protein Data Bank (PDB) currently holds over 140 000 biomolecular structures and continues to release new structures on a weekly basis. The PDB is an essential resource to the structural bioinformatics community to develop software that mine, use, categorize and analyze such data. New computational biology methods are evaluated using custom benchmarking sets derived as subsets of 3D experimentally determined structures and structural features from the PDB. Currently, such benchmarking features are manually curated with custom scripts in a non-standardized manner that results in slow distribution and updates with new experimental structures. Finally, there is a scarcity of standardized tools to rapidly query 3D descriptors of the entire PDB. Results Our solution is the Lemon framework, a C++11 library with Python bindings, which provides a consistent workflow methodology for selecting biomolecular interactions based on user criterion and computing desired 3D structural features. This framework can parse and characterize the entire PDB in <10 min on modern, multithreaded hardware. The speed in parsing is obtained by using the recently developed MacroMolecule Transmission Format to reduce the computational cost of reading text-based PDB files. The use of C++ lambda functions and Python bindings provide extensive flexibility for analysis and categorization of the PDB by allowing the user to write custom functions to suite their objective. We think Lemon will become a one-stop-shop to quickly mine the entire PDB to generate desired structural biology features. Availability and implementation The Lemon software is available as a C++ header library along with a PyPI package and example functions at https://github.com/chopralab/lemon. Supplementary information Supplementary data are available at Bioinformatics online.


2007 ◽  
Vol 02 (03n04) ◽  
pp. 267-271
Author(s):  
ZOLTÁN SZABADKA ◽  
RAFAEL ÖRDÖG ◽  
VINCE GROLMUSZ

The Protein Data Bank (PDB) is the most important depository of protein structural information, containing more than 45,000 deposited entries today. Because of its inhomogeneous structure, its fully automated processing is almost impossible. In a previous work, we cleaned and re-structured the entries in the Protein Data Bank, and from the result we have built the RS-PDB database. Using the RS-PDB database, we draw a Ramachandran-plot from 6,593 "perfect" polypeptide chains found in the PDB, containing 1,192,689 residues. This is a more than tenfold increase in the size of data analyzed before this work. The density of the data points makes it possible to draw a logarithmic heat map enhanced Ramachandran map, showing the fine inner structure of the right-handed α-helix region.


1998 ◽  
Vol 54 (6) ◽  
pp. 1078-1084 ◽  
Author(s):  
Joel L. Sussman ◽  
Dawei Lin ◽  
Jiansheng Jiang ◽  
Nancy O. Manning ◽  
Jaime Prilusky ◽  
...  

The Protein Data Bank (PDB) at Brookhaven National Laboratory, is a database containing experimentally determined three-dimensional structures of proteins, nucleic acids and other biological macromolecules, with approximately 8000 entries. Data are easily submittedviaPDB's WWW-based toolAutoDep, in either mmCIF or PDB format, and are most conveniently examinedviaPDB's WWW-based tool3DB Browser.


2015 ◽  
Vol 11 (1) ◽  
pp. 1-7 ◽  
Author(s):  
Michal Brylinski

AbstractThe Protein Data Bank (PDB) undergoes an exponential expansion in terms of the number of macromolecular structures deposited every year. A pivotal question is how this rapid growth of structural information improves the quality of three-dimensional models constructed by contemporary bioinformatics approaches. To address this problem, we performed a retrospective analysis of the structural coverage of a representative set of proteins using remote homology detected by COMPASS and HHpred. We show that the number of proteins whose structures can be confidently predicted increased during a 9-year period between 2005 and 2014 on account of the PDB growth alone. Nevertheless, this encouraging trend slowed down noticeably around the year 2008 and has yielded insignificant improvements ever since. At the current pace, it is unlikely that the protein structure prediction problem will be solved in the near future using existing template-based modeling techniques. Therefore, further advances in experimental structure determination, qualitatively better approaches in fold recognition, and more accurate template-free structure prediction methods are desperately needed.


Author(s):  
Д.А. Тихонов ◽  
D.A. Tikhonov

In this study, an analysis of distribution of the torsion angles Ω between helical axes in pairs of connected helices found in known proteins has been performed. The database for helical pairs was compiled using the Protein Data Bank taking into account the definite rules suggested earlier. The database was analyzed in order to elaborate its classification and find out novel structural features in helix packing. The database was subdivided into three subsets according to criterion of crossing helix projections on the parallel planes passing through the axes of the helices. It was shown that helical pairs not having crossing projections are distributed along whole range of angles Ω, although there are two maxima at Ω = 0° and Ω = 180°. Most of helical pairs of this subset are pairs formed by α-helices and 310- helices. It is shown that the distribution of all the helical pairs having the crossing helix projections has a maximum at 20° < Ω < 25°. In this subset, most helical pairs are formed by α-helices. The distribution of only α-helical pairs having crossing axes projections has three maxima, at –50° < Ω < –25°, 20° < Ω < 25°, and 70° < Ω < 110°.


2020 ◽  
Vol 27 ◽  
Author(s):  
Marian Vincenzi ◽  
Flavia Anna Mercurio ◽  
Marilisa Leone

Background: Proteins present a modular organization made up of several domains. Apart from domains playing catalytic functions, many others are crucial to recruit interactors. The latter domains can be defined "PIDs" (Protein Interaction Domains) and are responsible for pivotal outcomes in signal transduction and a certain array of normal physiological and disease-related pathways. Targeting such PIDs with small molecules and peptides able to modulate their interaction networks, may represent a valuable route to discover novel therapeutics. Objective: This work represents a continuation of a very recent review describing PIDs able to recognize post-translationally modified peptide segments. On the contrary, this second part concerns with PIDs that interact with simple peptide sequences provided with standard amino acids. Method: Crucial structural information on different domain subfamilies and their interactomes was gained by a wide search in different online available databases (including the PDB (Protein Data Bank), the Pfam (Protein family), and the SMART (Simple Modular Architecture Research Tool)). Pubmed was searched as well to explore the most recent literature related to the topic. Results and Conclusion: PIDs are multifaceted: they have all diverse structural features and can recognize several consensus sequences. PIDs can be linked to different diseases onset and progression, like cancer or viral infections and find applications in the personalized medicine field. Many efforts have been centered on peptide/peptidomimetic inhibitors of PIDs mediated interactions but much more work needs to be conducted to improve drug-likeness and interaction affinities of identified compounds.


Author(s):  
Joel L. Sussman ◽  
Frances C. Bernstein ◽  
Jiansheng Jiang ◽  
Michael Libeson ◽  
Dawei Lin ◽  
...  

Author(s):  
Krzysztof Szczepaniak ◽  
Adriana Bukala ◽  
Antonio Marinho da Silva Neto ◽  
Jan Ludwiczak ◽  
Stanislaw Dunin-Horkawicz

Abstract Motivation Coiled coils are widespread protein domains involved in diverse processes ranging from providing structural rigidity to the transduction of conformational changes. They comprise two or more α-helices that are wound around each other to form a regular supercoiled bundle. Owing to this regularity, coiled-coil structures can be described with parametric equations, thus enabling the numerical representation of their properties, such as the degree and handedness of supercoiling, rotational state of the helices, and the offset between them. These descriptors are invaluable in understanding the function of coiled coils and designing new structures of this type. The existing tools for such calculations require manual preparation of input and are therefore not suitable for the high-throughput analyses. Results To address this problem, we developed SamCC-Turbo, a software for fully-automated, per-residue measurement of coiled coils. By surveying Protein Data Bank with SamCC-Turbo, we generated a comprehensive atlas of ∼50,000 coiled-coil regions. This machine learning-ready data set features precise measurements as well as decomposes coiled-coil structures into fragments characterized by various degrees of supercoiling. The potential applications of SamCC-Turbo are exemplified by analyses in which we reveal general structural features of coiled coils involved in functions requiring conformational plasticity. Finally, we discuss further directions in the prediction and modeling of coiled coils. Availability SamCC-Turbo is available as a web server (https://lbs.cent.uw.edu.pl/samcc_turbo) and as a Python library (https://github.com/labstructbioinf/samcc_turbo), whereas the results of the Protein Data Bank scan can be browsed and downloaded at https://lbs.cent.uw.edu.pl/ccdb. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document