protein sequence database Latest Research Papers

HMD-AMP: Protein Language-Powered Hierarchical Multi-label Deep Forest for Annotating Antimicrobial Peptides

10.1101/2021.11.10.468157 ◽

2021 ◽

Author(s):

Qinze Yu ◽

Zhihang Dong ◽

Xingyu Fan ◽

Licheng Zong ◽

Yu Li

Keyword(s):

Deep Learning ◽

Antimicrobial Peptides ◽

Binary Classification ◽

Classification Task ◽

Protein Sequence Database ◽

Learning Methods ◽

Small Perturbations ◽

Label Protein ◽

Deep Forest ◽

Wet Lab

Identifying the targets of an antimicrobial peptide is a fundamental step in studying the innate immuneresponse and combating antibiotic resistance, and more broadly, precision medicine and public health. Therehave been extensive studies on the statistical and computational approaches to identify (i) whether a peptide is anantimicrobial peptide (AMP) or a non-AMP and (ii) which targets are these sequences effective to (Gram-positive,Gram-negative, etc.). Despite the existing deep learning methods on this problem, most of them are unable tohandle the small AMP classes (anti-insect, anti-parasite, etc.). And more importantly, some AMPs can havemultiple targets, which the previous methods fail to consider. In this study, we build a diverse and comprehensivemulti-label protein sequence database by collecting and cleaning amino acids from various AMP databases.To generate efficient representations and features for the small classes dataset, we take advantage of a proteinlanguage model trained on 250 million protein sequences. Based on that, we develop an end-to-end hierarchicalmulti-label deep forest framework, HMD-AMP, to annotate AMP comprehensively. After identifying an AMP, itfurther predicts what targets the AMP can effectively kill from eleven available classes. Extensive experimentssuggest that our framework outperforms state-of-the-art models in both the binary classification task and themulti-label classification task, especially on the minor classes. Compared with the previous deep learning methods,our method improves the performance on macro-AUROC by 11%. The model is robust against reduced featuresand small perturbations and produces promising results. We believe HMD-AMP contribute to both the future wet-lab investigations of the innate structural properties of different antimicrobial peptides and build promising empirical underpinnings for precise medicine with antibiotics.

Download Full-text

Assessing Protein Sequence Database Suitability Using De Novo Sequencing

Molecular & Cellular Proteomics ◽

10.1074/mcp.tir119.001752 ◽

2019 ◽

Vol 19 (1) ◽

pp. 198-208 ◽

Cited By ~ 6

Author(s):

Richard S. Johnson ◽

Brian C. Searle ◽

Brook L. Nunn ◽

Jason M. Gilmore ◽

Molly Phillips ◽

...

Keyword(s):

Protein Sequence ◽

De Novo ◽

De Novo Sequencing ◽

Sequence Database ◽

Protein Sequence Database

Download Full-text

A sectioning and database enrichment approach for improved peptide spectrum matching in large, genome-guided protein sequence databases

10.1101/843078 ◽

2019 ◽

Author(s):

Praveen Kumar ◽

James E. Johnson ◽

Caleb Easterly ◽

Subina Mehta ◽

Ray Sajulga ◽

...

Keyword(s):

Mass Spectrometry ◽

Protein Sequence ◽

Sequence Database ◽

Sequencing Data ◽

Proteomics Data ◽

Step Method ◽

Protein Sequence Database ◽

Sectioning Method ◽

Wide Range ◽

Sequence Databases

AbstractMulti-omics approaches focused on mass-spectrometry (MS)-based data, such as metaproteomics, utilize genomic and/or transcriptomic sequencing data to generate a comprehensive protein sequence database. These databases can be very large, containing millions of sequences, which reduces the sensitivity of matching tandem mass spectrometry (MS/MS) data to sequences to generate peptide spectrum matches (PSMs). Here, we describe a sectioning method for generating an enriched database for those protein sequences that are most likely present in the sample. Our evaluation demonstrates how this method helps to increase the sensitivity of PSMs while maintaining acceptable false discovery rate statistics. We demonstrate increased true positive PSM identifications using the sectioning method when compared to the traditional large database searching method, whereas it helped in reducing the false PSM identifications when compared to a previously described two-step method for reducing database size. The sectioning method for large sequence databases enables generation of an enriched protein sequence database and promotes increased sensitivity in identifying PSMs, while maintaining acceptable and manageable FDR. Furthermore, implementation in the Galaxy platform provides access to a usable and automated workflow for carrying out the method. Our results show the utility of this methodology for a wide-range of applications where genome-guided, large sequence databases are required for MS-based proteomics data analysis.

Download Full-text

Mining cytochrome p450 genes through next generation sequencing and metagenomic analysis from Binh Chau hot spring

TAP CHI SINH HOC ◽

10.15625/0866-7160/v41n3.10866 ◽

2019 ◽

Vol 41 (3) ◽

Author(s):

Nguyen Van Tung ◽

Nguyen Huy Hoang ◽

Nguyen Kim Thoa

Keyword(s):

Cytochrome P450 ◽

Hot Spring ◽

Putative Orfs ◽

Metagenomic Dna ◽

Protein Sequence Database ◽

Food Ingredients ◽

Bulk Chemicals ◽

Cytochrome P450 Genes ◽

Genes Encoding ◽

Generation Sequencing

Cytochrome P450s (CYPs) are one of the largest distributed enzymes, which catalyze more than 20 different reactions. At present, there has been an increasing realization of the power of P450 biocatalysts for the industrial synthesis of pharmaceuticals, agrochemicals, bulk chemicals, food ingredients, etc. On the other hand, the conditions of industrial processes at high temperature, high-pressure or in chemical solvent require the enzymes, which catalyze the bioconversion, have a specific properties such as thermostability, chemical tolerance or barophilicity. Up to date, the number of thermostable P450s is limited. Nowadays, DNA-metagenome technique gives us a chance to catch novel genes and unique interesting enzymes from microbial community in certain ecology. In this paper, metagenomic DNA extracted from water samples from Binh Chau hot spring was sequenced using Illumila’s HiSeq platform and was analysed to mining putative genes encoding cytochrome P450. The sequencing generated 9.4 Gb of reads containing 156,093 putative ORFs, of these, 106,903 genes were annotated in NCBI non-redundant protein sequence database. Among all the ORFs were annotated, 68 putative ORFs encoding cytochrome P450 were found belong to 36 specific groups of cytochrome P450 protein family. Of these, the melting temperature (Tm) from thirty-six completed ORFs was predicted for a better understanding of thermodynamic stability.

Download Full-text

Mining cytochrome p450 genes through next generation sequencing and metagenomic analysis from Binh Chau hot spring

ACADEMIA JOURNAL OF BIOLOGY ◽

10.15625/2615-9023/v41n3.10866 ◽

2019 ◽

Vol 41 (3) ◽

Author(s):

Nguyen Van Tung ◽

Nguyen Huy Hoang ◽

Nguyen Kim Thoa

Keyword(s):

Cytochrome P450 ◽

Hot Spring ◽

Putative Orfs ◽

Metagenomic Dna ◽

Protein Sequence Database ◽

Food Ingredients ◽

Bulk Chemicals ◽

Cytochrome P450 Genes ◽

Genes Encoding ◽

Generation Sequencing

Cytochrome P450s (CYPs) are one of the largest distributed enzymes, which catalyze more than 20 different reactions. At present, there has been an increasing realization of the power of P450 biocatalysts for the industrial synthesis of pharmaceuticals, agrochemicals, bulk chemicals, food ingredients, etc. On the other hand, the conditions of industrial processes at high temperature, high-pressure or in chemical solvent require the enzymes, which catalyze the bioconversion, have a specific properties such as thermostability, chemical tolerance or barophilicity. Up to date, the number of thermostable P450s is limited. Nowadays, DNA-metagenome technique gives us a chance to catch novel genes and unique interesting enzymes from microbial community in certain ecology. In this paper, metagenomic DNA extracted from water samples from Binh Chau hot spring was sequenced using Illumila’s HiSeq platform and was analysed to mining putative genes encoding cytochrome P450. The sequencing generated 9.4 Gb of reads containing 156,093 putative ORFs, of these, 106,903 genes were annotated in NCBI non-redundant protein sequence database. Among all the ORFs were annotated, 68 putative ORFs encoding cytochrome P450 were found belong to 36 specific groups of cytochrome P450 protein family. Of these, the melting temperature (Tm) from thirty-six completed ORFs was predicted for a better understanding of thermodynamic stability.

Download Full-text

The PSIPRED Protein Analysis Workbench: 20 years on

Nucleic Acids Research ◽

10.1093/nar/gkz297 ◽

2019 ◽

Vol 47 (W1) ◽

pp. W402-W407 ◽

Cited By ~ 177

Author(s):

Daniel W A Buchan ◽

David T Jones

Keyword(s):

Protein Sequence ◽

Web Site ◽

Web Server ◽

Protein Analysis ◽

Sequence Database ◽

Protein Sequence Database ◽

Predictive Algorithms ◽

Predictive Methods ◽

The Face ◽

Database Size

Abstract The PSIPRED Workbench is a web server offering a range of predictive methods to the bioscience community for 20 years. Here, we present the work we have completed to update the PSIPRED Protein Analysis Workbench and make it ready for the next 20 years. The main focus of our recent website upgrade work has been the acceleration of analyses in the face of increasing protein sequence database size. We additionally discuss any new software, the new hardware infrastructure, our webservices and web site. Lastly we survey updates to some of the key predictive algorithms available through our website.

Download Full-text

A curated gluten protein sequence database to support development of proteomics methods for determination of gluten in gluten-free foods

Journal of Proteomics ◽

10.1016/j.jprot.2017.03.026 ◽

2017 ◽

Vol 163 ◽

pp. 67-75 ◽

Cited By ~ 32

Author(s):

Sophie Bromilow ◽

Lee A. Gethings ◽

Mike Buckley ◽

Mike Bromley ◽

Peter R. Shewry ◽

...

Keyword(s):

Protein Sequence ◽

Gluten Protein ◽

Gluten Free ◽

Sequence Database ◽

Protein Sequence Database

Download Full-text

SWhybrid: A Hybrid-Parallel Framework for Large-Scale Protein Sequence Database Search

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ◽

10.1109/ipdps.2017.42 ◽

2017 ◽

Cited By ~ 2

Author(s):

Haidong Lan ◽

Weiguo Liu ◽

Yongchao Liu ◽

Bertil Schmidt

Keyword(s):

Protein Sequence ◽

Large Scale ◽

Database Search ◽

Sequence Database ◽

Protein Sequence Database ◽

Sequence Database Search

Download Full-text

Enhanced sequence identification technique for protein sequence database mining with hybrid frequent pattern mining algorithm

International Journal of Data Mining and Bioinformatics ◽

10.1504/ijdmb.2016.10001625 ◽

2016 ◽

Vol 16 (3) ◽

pp. 205

Author(s):

J. Jeyabharathi ◽

D. Shanthi

Keyword(s):

Protein Sequence ◽

Pattern Mining ◽

Frequent Pattern Mining ◽

Frequent Pattern ◽

Database Mining ◽

Sequence Database ◽

Protein Sequence Database ◽

Mining Algorithm ◽

Sequence Identification ◽

Identification Technique

Download Full-text

Enhanced sequence identification technique for protein sequence database mining with hybrid frequent pattern mining algorithm

International Journal of Data Mining and Bioinformatics ◽

10.1504/ijdmb.2016.080673 ◽

2016 ◽

Vol 16 (3) ◽

pp. 205 ◽

Cited By ~ 1

Author(s):

J. Jeyabharathi ◽

D. Shanthi

Keyword(s):

Protein Sequence ◽

Pattern Mining ◽

Frequent Pattern Mining ◽

Frequent Pattern ◽

Database Mining ◽

Sequence Database ◽

Protein Sequence Database ◽

Mining Algorithm ◽

Sequence Identification ◽

Identification Technique

Download Full-text

protein sequence database
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

HMD-AMP: Protein Language-Powered Hierarchical Multi-label Deep Forest for Annotating Antimicrobial Peptides

Assessing Protein Sequence Database Suitability Using De Novo Sequencing

A sectioning and database enrichment approach for improved peptide spectrum matching in large, genome-guided protein sequence databases

Mining cytochrome p450 genes through next generation sequencing and metagenomic analysis from Binh Chau hot spring

Mining cytochrome p450 genes through next generation sequencing and metagenomic analysis from Binh Chau hot spring

The PSIPRED Protein Analysis Workbench: 20 years on

A curated gluten protein sequence database to support development of proteomics methods for determination of gluten in gluten-free foods

SWhybrid: A Hybrid-Parallel Framework for Large-Scale Protein Sequence Database Search

Enhanced sequence identification technique for protein sequence database mining with hybrid frequent pattern mining algorithm

Enhanced sequence identification technique for protein sequence database mining with hybrid frequent pattern mining algorithm

Export Citation Format

protein sequence databaseRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

HMD-AMP: Protein Language-Powered Hierarchical Multi-label Deep Forest for Annotating Antimicrobial Peptides

Assessing Protein Sequence Database Suitability Using De Novo Sequencing

A sectioning and database enrichment approach for improved peptide spectrum matching in large, genome-guided protein sequence databases

Mining cytochrome p450 genes through next generation sequencing and metagenomic analysis from Binh Chau hot spring

Mining cytochrome p450 genes through next generation sequencing and metagenomic analysis from Binh Chau hot spring

The PSIPRED Protein Analysis Workbench: 20 years on

A curated gluten protein sequence database to support development of proteomics methods for determination of gluten in gluten-free foods

SWhybrid: A Hybrid-Parallel Framework for Large-Scale Protein Sequence Database Search

Enhanced sequence identification technique for protein sequence database mining with hybrid frequent pattern mining algorithm

Enhanced sequence identification technique for protein sequence database mining with hybrid frequent pattern mining algorithm

protein sequence database
Recently Published Documents