EDITORIAL Beyond sequence similarity – the curious case of GW/WG protein domain

Protein domain identification and improved sequence similarity searching using PSI-BLAST

Proteins Structure Function and Bioinformatics ◽

10.1002/prot.10175 ◽

2002 ◽

Vol 48 (4) ◽

pp. 672-681 ◽

Cited By ~ 37

Author(s):

Richard A. George ◽

Jaap Heringa

Keyword(s):

Sequence Similarity ◽

Protein Domain ◽

Similarity Searching ◽

Domain Identification

Download Full-text

HACRE1, a recently inserted copia-like retrotransposon of sunflower (Helianthus annuus L.)

Genome ◽

10.1139/g09-064 ◽

2009 ◽

Vol 52 (11) ◽

pp. 904-911 ◽

Cited By ~ 12

Author(s):

M. Buti ◽

T. Giordani ◽

M. Vukich ◽

L. Gentzbittel ◽

L. Pistelli ◽

...

Keyword(s):

Helianthus Annuus ◽

Sequence Similarity ◽

Protein Domain ◽

Long Terminal Repeats ◽

High Sequence Identity ◽

Nonsense Mutations ◽

Reading Frame ◽

Sequence Identity ◽

Isolation And Characterization ◽

Copia Retrotransposon

In this paper we report on the isolation and characterization, for the first time, of a complete 6511 bp retrotransposon of sunflower. Considering its protein domain order and sequence similarity to other copia elements of dicotyledons, this retrotransposon was assigned to the copia retrotransposon superfamily and named HACRE1 ( Helianthus annuus copia-like retroelement 1). HACRE1 carries 5′ and 3′ long terminal repeats (LTRs) flanking an internal region of 4661 bp. The LTRs are identical in their sequence except for two deletions of 7 and 5 nucleotides in the 5′ LTR. Based on the sequence identity of the LTRs, HACRE1 was estimated to have inserted within the last ∼84 000 years. The isolated sequence contains a complete open reading frame with only one complete reading frame. The absence of nonsense mutations agrees with the very high sequence identity between LTRs, confirming that HACRE1 insertion is recent. The haploid genome of sunflower (inbred line HCM) contains about 160 copies of HACRE1. This retrotransposon is expressed in leaflets from 7-day-old plantlets under different light conditions, probably in relation to the occurrence of many putative light-related regulatory cis-elements in the LTRs. However, sequenced cDNAs show less variability than HACRE1 genomic sequences, indicating that only a subset of this family is expressed under these conditions.

Download Full-text

Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

F1000Research ◽

10.12688/f1000research.9416.1 ◽

2016 ◽

Vol 5 ◽

pp. 1987 ◽

Cited By ~ 7

Author(s):

Jasper J. Koehorst ◽

Edoardo Saccenti ◽

Peter J. Schaap ◽

Vitor A. P. Martins dos Santos ◽

Maria Suarez-Diez

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

Sequence Similarity ◽

Computational Cost ◽

Protein Domain ◽

Gene Acquisition ◽

Bacterial Fitness ◽

Efficient Alternative ◽

Comparative Functional Genomics ◽

High Computational Cost

A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic bounderies. As the computational cost scales linearly, and not quadratically with the number of genomes, it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.

Download Full-text

Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

F1000Research ◽

10.12688/f1000research.9416.3 ◽

2017 ◽

Vol 5 ◽

pp. 1987 ◽

Cited By ~ 2

Author(s):

Jasper J. Koehorst ◽

Edoardo Saccenti ◽

Peter J. Schaap ◽

Vitor A. P. Martins dos Santos ◽

Maria Suarez-Diez

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

Sequence Similarity ◽

Computational Cost ◽

Protein Domain ◽

Gene Acquisition ◽

Bacterial Fitness ◽

Efficient Alternative ◽

Comparative Functional Genomics ◽

High Computational Cost

A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic boundaries, and it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.

Download Full-text

Large-scale sequence similarity analysis reveals the scope of sequence and function divergence in PilZ domain proteins

10.1101/2020.02.11.943704 ◽

2020 ◽

Author(s):

Qing Wei Cheang ◽

Shuo Sheng ◽

Linghui Xu ◽

Zhao-Xun Liang

Keyword(s):

Large Scale ◽

Sequence Similarity ◽

Protein Domain ◽

Divergent Evolution ◽

Cellular Functions ◽

Vast Number ◽

Future Studies ◽

Function Relationship ◽

And Function ◽

Scale Sequence

AbstractPilZ domain-containing proteins constitute a superfamily of widely distributed bacterial signalling proteins. Although studies have established the canonical PilZ domain as an adaptor protein domain evolved to specifically bind the second messenger c-di-GMP, mounting evidence suggest that the PilZ domain has undergone enormous divergent evolution to generate a superfamily of proteins that are characterized by a wide range of c-di-GMP-binding affinity, binding partners and cellular functions. The divergent evolution has even generated families of non-canonical PilZ domains that completely lack c-di-GMP binding ability. In this study, we performed a large-scale sequence analysis on more than 28,000 single- and di-domain PilZ proteins using the sequence similarity networking tool created originally to analyse functionally diverse enzyme superfamilies. The sequence similarity networks (SSN) generated by the analysis feature a large number of putative isofunctional protein clusters, and thus, provide an unprecedented panoramic view of the sequence-function relationship and function diversification in PilZ proteins. Some of the protein clusters in the networks are considered as unexplored clusters that contain proteins with completely unknown biological function; whereas others contain one, two or a few functionally known proteins, and therefore, enabling us to infer the cellular function of uncharacterized homologs or orthologs. With the ultimate goal of elucidating the diverse roles played by PilZ proteins in bacterial signal transduction, the work described here will facilitate the annotation of the vast number of PilZ proteins encoded by bacterial genome and help to prioritize functionally unknown PilZ proteins for future studies.ImportanceAlthough PilZ domain is best known as the protein domain evolved specifically for the binding of the second messenger c-di-GMP, divergent evolution has generated a superfamily of PilZ proteins with a diversity of ligand or protein-binding properties and cellular functions. We analysed the sequences of more than 28,000 PilZ proteins using the sequence similarity networking (SSN) tool to yield a global view of the sequence-function relationship and function diversification in PilZ proteins. The results will facilitate the annotation of the vast number of PilZ proteins encoded by bacterial genomes and help us prioritize PilZ proteins for future studies.

Download Full-text

Faculty Opinions recommendation of Protein domain identification and improved sequence similarity searching using PSI-BLAST.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1009405.125959 ◽

2002 ◽

Author(s):

Burkhard Rost

Keyword(s):

Sequence Similarity ◽

Protein Domain ◽

Similarity Searching ◽

Domain Identification

Download Full-text

Maize GO Annotation - Methods, Evaluation, and Review (maize-GAMER)

10.1101/222836 ◽

2017 ◽

Author(s):

Kokulapalan Wimalanathan ◽

Iddo Friedberg ◽

Carson M. Andorf ◽

Carolyn J. Lawrence-Dill

Keyword(s):

Gold Standard ◽

Functional Annotation ◽

Sequence Similarity ◽

Protein Domain ◽

High Coverage ◽

Protein Coding ◽

Go Annotation ◽

Protein Coding Genes ◽

Standard Set ◽

Maize Protein

1SummaryWe created a new high-coverage, robust, and reproducible functional annotation of maize protein coding genes based on Gene Ontology (GO) term assignments. Whereas the existing Phytozome and Gramene maize GO annotation sets only cover 41% and 56% of maize protein coding genes, respectively, this study provides annotations for 100% of the genes. We also compared the quality of our newly-derived annotations with the existing Gramene and Phytozome functional annotation sets by comparing all three to a manually annotated gold standard set of 1,619 genes where annotations were primarily inferred from direct assay or mutant phenotype. Evaluations based on the gold standard indicate that our new annotation set is measurably more accurate than those from Phytozome and Gramene. To derive this new high-coverage, high-confidence annotation set we used sequence-similarity and protein-domain-presence methods as well as mixed-method pipelines that developed for the Critical Assessment of Function Annotation (CAFA) challenge. Our project to improve maize annotations is called maize-GAMER (GO Annotation Method, Evaluation, and Review) and the newly-derived annotations are accessible via MaizeGDB (http://download.maizegdb.org/maize-GAMER) and CyVerse (B73 RefGen_v3 5b+ at doi: doi.org/10.7946/P2S62P and B73 RefGen_v4 Zm00001d.2 at doi: doi.org/10.7946/P2M925).

Download Full-text

Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

F1000Research ◽

10.12688/f1000research.9416.2 ◽

2016 ◽

Vol 5 ◽

pp. 1987 ◽

Cited By ~ 6

Author(s):

Jasper J. Koehorst ◽

Edoardo Saccenti ◽

Peter J. Schaap ◽

Vitor A. P. Martins dos Santos ◽

Maria Suarez-Diez

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

Sequence Similarity ◽

Computational Cost ◽

Protein Domain ◽

Gene Acquisition ◽

Bacterial Fitness ◽

Efficient Alternative ◽

Comparative Functional Genomics ◽

High Computational Cost

A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic boundaries, and it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.

Download Full-text

Protein Structure-Guided Hidden Markov Models (HMMs) as A Powerful Method in the Detection of Ancestral Endogenous Viral Elements

Viruses ◽

10.3390/v11040320 ◽

2019 ◽

Vol 11 (4) ◽

pp. 320 ◽

Cited By ~ 1

Author(s):

Heleri Kirsip ◽

Aare Abroi

Keyword(s):

Protein Structure ◽

Hidden Markov Models ◽

Markov Models ◽

Rna Viruses ◽

Hidden Markov ◽

Sequence Similarity ◽

Genetic Material ◽

Protein Domain ◽

Transfer Event ◽

Endogenous Viral Elements

It has been believed for a long time that the transfer and fixation of genetic material from RNA viruses to eukaryote genomes is very unlikely. However, during the last decade, there have been several cases in which “virus-to-host” gene transfer from various viral families into various eukaryotic phyla have been described. These transfers have been identified by sequence similarity, which may disappear very quickly, especially in the case of RNA viruses. However, compared to sequences, protein structure is known to be more conserved. Applying protein structure-guided protein domain-specific Hidden Markov Models, we detected homologues of the Virgaviridae capsid protein in Schizophora flies. Further data analysis supported “virus-to-host” transfer into Schizophora ancestors as a single transfer event. This transfer was not identifiable by BLAST or by other methods we applied. Our data show that structure-guided Hidden Markov Models should be used to detect ancestral virus-to-host transfers.

Download Full-text

RRE-Finder: A Genome-Mining Tool for Class-Independent RiPP Discovery

10.1101/2020.03.14.992123 ◽

2020 ◽

Cited By ~ 5

Author(s):

Alexander M. Kloosterman ◽

Kyle E. Shelton ◽

Gilles P. van Wezel ◽

Marnix H. Medema ◽

Douglas A. Mitchell

Keyword(s):

Sequence Similarity ◽

Genome Mining ◽

Sequence Divergence ◽

High Sensitivity ◽

Gene Clusters ◽

Protein Domain ◽

Post Translational Modification ◽

Protein Database ◽

Recognition Element ◽

A Genome

AbstractNearly half of the classes of natural products known as ribosomally synthesized and post-translationally modified peptides (RiPPs) are reliant on a protein domain called the RiPP recognition element (RRE) for peptide maturation. The RRE binds specifically to a linear precursor peptide and directs the post-translational modification enzymes to their substrate. Given its prevalence across various types of RiPP biosynthetic gene clusters (BGCs), the RRE could theoretically be used as a bioinformatic handle to identify novel classes of RiPPs. In addition, due to the high affinity and specificity of most RRE:precursor peptide complexes, a thorough understanding of the RRE domain could be exploited for biotechnological applications. However, sequence divergence of the RRE domain across RiPP classes has precluded automated identification of RREs based solely on sequence similarity. Here, we introduce RRE-Finder, a novel tool for identifying RRE domains with high sensitivity. RRE-Finder can be used in “precision” mode to confidently identify RREs in a class-specific manner or in “exploratory” mode, which was designed to assist in the discovery of novel RiPP classes. RRE-Finder operating in precision mode on the UniProtKB protein database retrieved over 30,000 high-confidence RREs spanning all characterized RRE-dependent RiPP classes, as well as several yet-uncharacterized RiPP, putatively novel gene cluster architectures that will require future experimental work. Finally, RRE-Finder was used in precision mode to explore a possible evolutionary origin of the RRE domain. Altogether, RRE-Finder provides a powerful new method to probe RiPP biosynthetic diversity and delivers a rich dataset of RRE sequences that will provide a foundation for deeper biochemical studies into this intriguing and versatile protein domain.

Download Full-text