scholarly journals Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

2018 ◽  
Author(s):  
Ehsaneddin Asgari ◽  
Alice McHardy ◽  
Mohammad R.K. Mofrad

ABSTRACTIn this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variable-length protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw k-mer features.AvailabilityImplementations of our method will be available under the Apache 2 licence athttp://llp.berkeley.edu/dimotifandhttp://llp.berkeley.edu/protvecx.

PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e7055 ◽  
Author(s):  
Jason E. McDermott ◽  
John R. Cort ◽  
Ernesto S. Nakayasu ◽  
Jonathan N. Pruneda ◽  
Christopher Overall ◽  
...  

Background Although pathogenic Gram-negative bacteria lack their own ubiquitination machinery, they have evolved or acquired virulence effectors that can manipulate the host ubiquitination process through structural and/or functional mimicry of host machinery. Many such effectors have been identified in a wide variety of bacterial pathogens that share little sequence similarity amongst themselves or with eukaryotic ubiquitin E3 ligases. Methods To allow identification of novel bacterial E3 ubiquitin ligase effectors from protein sequences we have developed a machine learning approach, the SVM-based Identification and Evaluation of Virulence Effector Ubiquitin ligases (SIEVE-Ub). We extend the string kernel approach used previously to sequence classification by introducing reduced amino acid (RED) alphabet encoding for protein sequences. Results We found that 14mer peptides with amino acids represented as simply either hydrophobic or hydrophilic provided the best models for discrimination of E3 ligases from other effector proteins with a receiver-operator characteristic area under the curve (AUC) of 0.90. When considering a subset of E3 ubiquitin ligase effectors that do not fall into known sequence based families we found that the AUC was 0.82, demonstrating the effectiveness of our method at identifying novel functional family members. Feature selection was used to identify a parsimonious set of 10 RED peptides that provided good discrimination, and these peptides were found to be located in functionally important regions of the proteins involved in E2 and host target protein binding. Our general approach enables construction of models based on other effector functions. We used SIEVE-Ub to predict nine potential novel E3 ligases from a large set of bacterial genomes. SIEVE-Ub is available for download at https://doi.org/10.6084/m9.figshare.7766984.v1 or https://github.com/biodataganache/SIEVE-Ub for the most current version.


2020 ◽  
Vol 26 (24) ◽  
pp. 2807-2816 ◽  
Author(s):  
Yun Su Jang ◽  
Tímea Mosolygó

: Bacteria within biofilms are more resistant to antibiotics and chemical agents than planktonic bacteria in suspension. Treatment of biofilm-associated infections inevitably involves high dosages and prolonged courses of antimicrobial agents; therefore, there is a potential risk of the development of antimicrobial resistance (AMR). Due to the high prevalence of AMR and its association with biofilm formation, investigation of more effective anti-biofilm agents is required. : From ancient times, herbs and spices have been used to preserve foods, and their antimicrobial, anti-biofilm and anti-quorum sensing properties are well known. Moreover, phytochemicals exert their anti-biofilm properties at sub-inhibitory concentrations without providing the opportunity for the emergence of resistant bacteria or harming the host microbiota. : With increasing scientific attention to natural phytotherapeutic agents, numerous experimental investigations have been conducted in recent years. The present paper aims to review the articles published in the last decade in order to summarize a) our current understanding of AMR in correlation with biofilm formation and b) the evidence of phytotherapeutic agents against bacterial biofilms and their mechanisms of action. The main focus has been put on herbal anti-biofilm compounds tested to date in association with Staphylococcus aureus, Pseudomonas aeruginosa and food-borne pathogens (Salmonella spp., Campylobacter spp., Listeria monocytogenes and Escherichia coli).


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Giovanni Scala ◽  
Antonio Federico ◽  
Dario Greco

Abstract Background The investigation of molecular alterations associated with the conservation and variation of DNA methylation in eukaryotes is gaining interest in the biomedical research community. Among the different determinants of methylation stability, the DNA composition of the CpG surrounding regions has been shown to have a crucial role in the maintenance and establishment of methylation statuses. This aspect has been previously characterized in a quantitative manner by inspecting the nucleotidic composition in the region. Research in this field still lacks a qualitative perspective, linked to the identification of certain sequences (or DNA motifs) related to particular DNA methylation phenomena. Results Here we present a novel computational strategy based on short DNA motif discovery in order to characterize sequence patterns related to aberrant CpG methylation events. We provide our framework as a user-friendly, shiny-based application, CpGmotifs, to easily retrieve and characterize DNA patterns related to CpG methylation in the human genome. Our tool supports the functional interpretation of deregulated methylation events by predicting transcription factors binding sites (TFBS) encompassing the identified motifs. Conclusions CpGmotifs is an open source software. Its source code is available on GitHub https://github.com/Greco-Lab/CpGmotifs and a ready-to-use docker image is provided on DockerHub at https://hub.docker.com/r/grecolab/cpgmotifs.


2012 ◽  
Vol 56 (7) ◽  
pp. 3481-3491 ◽  
Author(s):  
Michael Widmann ◽  
Jürgen Pleiss ◽  
Peter Oelschlaeger

ABSTRACTMetallo-β-lactamases (MBLs) are enzymes that hydrolyze β-lactam antibiotics, resulting in bacterial resistance to these drugs. These proteins have caused concerns due to their facile transference, broad substrate spectra, and the absence of clinically useful inhibitors. To facilitate the classification, nomenclature, and analysis of MBLs, an automated database system was developed, the Metallo-β-Lactamase Engineering Database (MBLED) (http://www.mbled.uni-stuttgart.de). It contains information on MBLs retrieved from the NCBI peptide database while strictly following the nomenclature by Jacoby and Bush (http://www.lahey.org/Studies/) and the generally accepted class B β-lactamase (BBL) standard numbering scheme for MBLs. The database comprises 597 MBL protein sequences and enables systematic analyses of these sequences. A systematic analysis employing the database resulted in the generation of mutation profiles of assigned IMP- and VIM-type MBLs, the identification of five MBL protein entries from the NCBI peptide database that were inconsistent with the Jacoby and Bush nomenclature, and the identification of 15 new IMP candidates and 9 new VIM candidates. Furthermore, the database was used to identify residues with high mutation frequencies and variability (mutation hot spots) that were unexpectedly distant from the active site located in the ββ sandwich: positions 208 and 266 in the IMP family and positions 215 and 258 in the VIM family. We expect that the MBLED will be a valuable tool for systematically cataloguing and analyzing the increasing number of MBLs being reported.


Microbiology ◽  
2021 ◽  
Vol 167 (10) ◽  
Author(s):  
Mengting Shi ◽  
Yue Zheng ◽  
Xianghong Wang ◽  
Zhengjia Wang ◽  
Menghua Yang

Vibrio cholerae the causative agent of cholera, uses a large number of coordinated transcriptional regulatory events to transition from its environmental reservoir to the host intestine, which is its preferred colonization site. Transcription of the mannose-sensitive haemagglutinin pilus (MSHA), which aids the persistence of V. cholerae in aquatic environments, but causes its clearance by host immune defenses, was found to be regulated by a yet unknown mechanism during the infection cycle of V. cholerae . In this study, genomic expression library screening revealed that two regulators, VC1371 and VcRfaH, are able to positively activate the transcription of MSHA operon. VC1371 is localized and active in the cell membrane. Deletion of vc1371 or VcrfaH genes in V. cholerae resulted in less MshA protein production and less efficiency of biofilm formation compared to that in the wild-type strain. An adult mouse model showed that the mutants with vc1371 or VcrfaH deletion colonized less efficiently than the wild-type; the VcrfaH deletion mutant showed less colonization efficiency in the infant mouse model. The findings strongly suggested that the two regulators, namely VC1371 and VcRfaH, which are involved in the regulation of MSHA expression, play an important role in V. cholerae biofilm formation and colonization in mice.


Microbiology ◽  
2021 ◽  
Vol 167 (3) ◽  
Author(s):  
Sathi Mallick ◽  
Shanti Kiran ◽  
Tapas Kumar Maiti ◽  
Anindya S. Ghosh

Escherichia coli low-molecular-mass (LMM) Penicillin-binding proteins (PBPs) help in hydrolysing the peptidoglycan fragments from their cell wall and recycling them back into the growing peptidoglycan matrix, in addition to their reported involvement in biofilm formation. Biofilms are external slime layers of extra-polymeric substances that sessile bacterial cells secrete to form a habitable niche for themselves. Here, we hypothesize the involvement of Escherichia coli LMM PBPs in regulating the nature of exopolysaccharides (EPS) prevailing in its extra-polymeric substances during biofilm formation. Therefore, this study includes the assessment of physiological characteristics of E. coli CS109 LMM PBP deletion mutants to address biofilm formation abilities, viability and surface adhesion. Finally, EPS from parent CS109 and its ΔPBP4 and ΔPBP5 mutants were purified and analysed for sugars present. Deletions of LMM PBP reduced biofilm formation, bacterial adhesion and their viability in biofilms. Deletions also diminished EPS production by ΔPBP4 and ΔPBP5 mutants, purification of which suggested an increased overall negative charge compared with their parent. Also, EPS analyses from both mutants revealed the appearance of an unusual sugar, xylose, that was absent in CS109. Accordingly, the reason for reduced biofilm formation in LMM PBP mutants may be speculated as the subsequent production of xylitol and a hindrance in the standard flow of the pentose phosphate pathway.


2017 ◽  
Author(s):  
Morgan N. Price ◽  
Adam P. Arkin

AbstractLarge-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources that link protein sequences to scientific articles (Swiss-Prot, GeneRIF, and EcoCyc). PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/.


2021 ◽  
Author(s):  
Vasily V. Grinev ◽  
Mikalai M. Yatskou ◽  
Victor V. Skakun ◽  
Maryna K. Chepeleva ◽  
Petr V. Nazarov

AbstractMotivationModern methods of whole transcriptome sequencing accurately recover nucleotide sequences of RNA molecules present in cells and allow for determining their quantitative abundances. The coding potential of such molecules can be estimated using open reading frames (ORF) finding algorithms, implemented in a number of software packages. However, these algorithms show somewhat limited accuracy, are intended for single-molecule analysis and do not allow selecting proper ORFs in the case of long mRNAs containing multiple ORF candidates.ResultsWe developed a computational approach, corresponding machine learning model and a package, dedicated to automatic identification of the ORFs in large sets of human mRNA molecules. It is based on vectorization of nucleotide sequences into features, followed by classification using a random forest. The predictive model was validated on sets of human mRNA molecules from the NCBI RefSeq and Ensembl databases and demonstrated almost 95% accuracy in detecting true ORFs. The developed methods and pre-trained classification model were implemented in a powerful ORFhunteR computational tool that performs an automatic identification of true ORFs among large set of human mRNA molecules.Availability and implementationThe developed open-source R package ORFhunteR is available for the community at GitHub repository (https://github.com/rfctbio-bsu/ORFhunteR), from Bioconductor (https://bioconductor.org/packages/devel/bioc/html/ORFhunteR.html) and as a web application (http://orfhunter.bsu.by).


Sign in / Sign up

Export Citation Format

Share Document