Motif Discovery in Protein Sequences

We are in the midst of an explosive increase in the number of DNA and protein sequences available for study, as various genome projects come on line. This wealth of information offers important opportunities for understanding many biological processes and developing new plant and animal models, and ultimately drugs, for human diseases, in addition to other applications of modern biotechnology. Unfortunately, sequences are accumulating at a pace that strains present methods for extracting significant biological information from them. A consequence of this explosion in the sequence databases is that there is much interest and effort in developing tools that can efficiently and automatically extract the relevant biological information in sequence data and make it available for use in biology and medicine. In this chapter, we describe one such method that we have developed based on algorithms from artificial intelligence research. We call this software tool MEME (Multiple Expectation-maximization for Motif Elicitation). It has the attractive property that it is an “unsupervised” discovery tool: it can identify motifs, such as regulatory sites in DNA and functional domains in proteins, from large or small groups of unaligned sequences. As we show below, motifs are a rich source of information about a dataset; they can be used to discover other homologs in a database, to identify protein subsets that contain one or more motifs, and to provide information for mutagenesis studies to elucidate structure and function in the protein family as well as its evolution. Learning tools are used to extract higher level biological patterns from lower level DNA and protein sequence data. In contrast, search tools such as BLAST (Basic Local Alignment Search Tool) take a given higher level pattern and find all items in a database that possess the pattern. Searching for items that have a certain pattern is a problem intrinsically easier than discovering what the pattern is from items that possess it. The patterns considered here are motifs, which for DNA data can be subsequences that interact with transcription factors, polymerases, and other proteins.

Download Full-text

A Graph-Theoretical Approach for Motif Discovery in Protein Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2015.2511750 ◽

2017 ◽

Vol 14 (1) ◽

pp. 121-130 ◽

Cited By ~ 2

Author(s):

Elena Czeizler ◽

Tommi Hirvola ◽

Kalle Karhu

Keyword(s):

Theoretical Approach ◽

Motif Discovery ◽

Protein Sequences ◽

Graph Theoretical Approach

Download Full-text

Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)

10.1101/345843 ◽

2018 ◽

Cited By ~ 1

Author(s):

Ehsaneddin Asgari ◽

Alice McHardy ◽

Mohammad R.K. Mofrad

Keyword(s):

Biofilm Formation ◽

Motif Discovery ◽

Protein Sequences ◽

Variable Length ◽

Experimental Investigations ◽

Text Compression ◽

Large Set ◽

High Recall ◽

Protein Motifs ◽

Link Type

ABSTRACTIn this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variable-length protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw k-mer features.AvailabilityImplementations of our method will be available under the Apache 2 licence athttp://llp.berkeley.edu/dimotifandhttp://llp.berkeley.edu/protvecx.

Download Full-text