Mantis: flexible and consensus-driven genome annotation

Pedro Queirós; Francesco Delogu; Oskar Hickl; Patrick May; Paul Wilmes

doi:10.1093/gigascience/giab042

Mantis: flexible and consensus-driven genome annotation

GigaScience ◽

10.1093/gigascience/giab042 ◽

2021 ◽

Vol 10 (6) ◽

Author(s):

Pedro Queirós ◽

Francesco Delogu ◽

Oskar Hickl ◽

Patrick May ◽

Paul Wilmes

Keyword(s):

Protein Function ◽

Reference Data ◽

Search Algorithm ◽

Rapid Development ◽

Data Sources ◽

Annotation Tool ◽

High Quality ◽

High Coverage ◽

Function Annotation ◽

Protein Function Annotation

Abstract Background The rapid development of the (meta-)omics fields has produced an unprecedented amount of high-resolution and high-fidelity data. Through the use of these datasets we can infer the role of previously functionally unannotated proteins from single organisms and consortia. In this context, protein function annotation can be described as the identification of regions of interest (i.e., domains) in protein sequences and the assignment of biological functions. Despite the existence of numerous tools, challenges remain in terms of speed, flexibility, and reproducibility. In the big data era, it is also increasingly important to cease limiting our findings to a single reference, coalescing knowledge from different data sources, and thus overcoming some limitations in overly relying on computationally generated data from single sources. Results We implemented a protein annotation tool, Mantis, which uses database identifiers intersection and text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output. Mantis is flexible, allowing for the customization of reference data and execution parameters, and is reproducible across different research goals and user environments. We implemented a depth-first search algorithm for domain-specific annotation, which significantly improved annotation performance compared to sequence-wide annotation. The parallelized implementation of Mantis results in short runtimes while also outputting high coverage and high-quality protein function annotations. Conclusions Mantis is a protein function annotation tool that produces high-quality consensus-driven protein annotations. It is easy to set up, customize, and use, scaling from single genomes to large metagenomes. Mantis is available under the MIT license at https://github.com/PedroMTQ/mantis.

Download Full-text

Mantis: flexible and consensus-driven genome annotation

10.1101/2020.11.02.360933 ◽

2020 ◽

Author(s):

Pedro Queirós ◽

Francesco Delogu ◽

Oskar Hickl ◽

Patrick May ◽

Paul Wilmes

Keyword(s):

Protein Function ◽

Reference Data ◽

Search Algorithm ◽

Rapid Development ◽

Data Sources ◽

Annotation Tool ◽

High Quality ◽

Function Annotation ◽

Protein Function Annotation ◽

Set Up

AbstractBackgroundThe past decades have seen a rapid development of the (meta-)omics fields, producing an unprecedented amount of data. Through the use of well-characterized datasets we can infer the role of previously functionally unannotated proteins from single organisms and consortia. In this context, protein function annotation allows the identification of regions of interest (i.e. domains) in protein sequences and the assignment of biological functions. Despite the existence of numerous tools, some challenges remain, specifically in terms of speed, flexibility, and reproducibility. In the era of big data it also becomes increasingly important to cease limiting our findings to a single reference, coalescing knowledge from different data sources, thus overcoming some limitations in overly relying on computationally generated data.ResultsWe implemented a protein annotation tool - Mantis, which uses text mining to integrate knowledge from multiple reference data sources into a single consensus-driven output. Mantis is flexible, allowing for total customization of the reference data used, adaptable, and reproducible across different research goals and user environments. We implemented a depth-first search algorithm for domain-specific annotation, which led to an average 0.038 increase in precision when compared to sequence-wide annotation. Mantis is fast, annotating an average genome in 25-40 minutes, whilst also outputting high-quality annotations (average coverage 81.4%, average precision 0.892).ConclusionsMantis is a protein function annotation tool that produces high-quality consensusdriven protein annotations. It is easy to set up, customize, and use, scaling from single genomes to large metagenomes. Mantis is available under the MIT license available at https://github.com/PedroMTQ/mantis.

Download Full-text

A combined approach for genome wide protein function annotation/prediction

Proteome Science ◽

10.1186/1477-5956-11-s1-s1 ◽

2013 ◽

Vol 11 (Suppl 1) ◽

pp. S1 ◽

Cited By ~ 17

Author(s):

Alfredo Benso ◽

Stefano Di Carlo ◽

Hafeez ur Rehman ◽

Gianfranco Politano ◽

Alessandro Savino ◽

...

Keyword(s):

Protein Function ◽

Combined Approach ◽

Function Annotation ◽

Protein Function Annotation ◽

Genome Wide

Download Full-text

Protein function annotation using protein domain family resources

Methods ◽

10.1016/j.ymeth.2015.09.029 ◽

2016 ◽

Vol 93 ◽

pp. 24-34 ◽

Cited By ~ 17

Author(s):

Sayoni Das ◽

Christine A. Orengo

Keyword(s):

Protein Function ◽

Protein Domain ◽

Domain Family ◽

Family Resources ◽

Function Annotation ◽

Protein Function Annotation ◽

Protein Domain Family

Download Full-text

The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications

BMC Genomics ◽

10.1186/1471-2164-9-s2-s2 ◽

2008 ◽

Vol 9 (Suppl 2) ◽

pp. S2 ◽

Cited By ~ 24

Author(s):

Inbal Halperin ◽

Dariya S Glazer ◽

Shirley Wu ◽

Russ B Altman

Keyword(s):

Protein Function ◽

Function Annotation ◽

Protein Function Annotation ◽

Novel Applications

Download Full-text

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

BMC Bioinformatics ◽

10.1186/1471-2105-9-52 ◽

2008 ◽

Vol 9 (1) ◽

pp. 52 ◽

Cited By ~ 23

Author(s):

Chenggang Yu ◽

Nela Zavaljevski ◽

Valmik Desai ◽

Seth Johnson ◽

Fred J Stevens ◽

...

Keyword(s):

Protein Function ◽

Function Annotation ◽

Protein Function Annotation ◽

Genome Wide ◽

Automated Pipeline

Download Full-text

Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining ◽

10.1145/3447548.3467163 ◽

2021 ◽

Author(s):

David Dohan ◽

Andreea Gane ◽

Maxwell L. Bileschi ◽

David Belanger ◽

Lucy Colwell

Keyword(s):

Protein Function ◽

Function Annotation ◽

Protein Function Annotation

Download Full-text

Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning

Briefings in Bioinformatics ◽

10.1093/bib/bbz081 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1437-1447 ◽

Cited By ~ 19

Author(s):

Jiajun Hong ◽

Yongchao Luo ◽

Yang Zhang ◽

Junbiao Ying ◽

Weiwei Xue ◽

...

Keyword(s):

Deep Learning ◽

False Discovery Rate ◽

Protein Function ◽

Functional Annotation ◽

De Novo ◽

Learning Algorithm ◽

Function Annotation ◽

False Discovery ◽

Protein Function Annotation ◽

Annotation Accuracy

Abstract Functional annotation of protein sequence with high accuracy has become one of the most important issues in modern biomedical studies, and computational approaches of significantly accelerated analysis process and enhanced accuracy are greatly desired. Although a variety of methods have been developed to elevate protein annotation accuracy, their ability in controlling false annotation rates remains either limited or not systematically evaluated. In this study, a protein encoding strategy, together with a deep learning algorithm, was proposed to control the false discovery rate in protein function annotation, and its performances were systematically compared with that of the traditional similarity-based and de novo approaches. Based on a comprehensive assessment from multiple perspectives, the proposed strategy and algorithm were found to perform better in both prediction stability and annotation accuracy compared with other de novo methods. Moreover, an in-depth assessment revealed that it possessed an improved capacity of controlling the false discovery rate compared with traditional methods. All in all, this study not only provided a comprehensive analysis on the performances of the newly proposed strategy but also provided a tool for the researcher in the fields of protein function annotation.

Download Full-text

Functional classification of CATH superfamilies: a domain-based approach for protein function annotation

Bioinformatics ◽

10.1093/bioinformatics/btv398 ◽

2015 ◽

Vol 31 (21) ◽

pp. 3460-3467 ◽

Cited By ~ 43

Author(s):

Sayoni Das ◽

David Lee ◽

Ian Sillitoe ◽

Natalie L. Dawson ◽

Jonathan G. Lees ◽

...

Keyword(s):

Protein Function ◽

Functional Classification ◽

Function Annotation ◽

Protein Function Annotation

Download Full-text

Electrostatics-based computational methods for understanding polymerase mechanism and for protein function annotation

10.17760/d20005099 ◽

2014 ◽

Author(s):

Ramya Parasuram

Keyword(s):

Computational Methods ◽

Protein Function ◽

Function Annotation ◽

Protein Function Annotation

Download Full-text

SwiftOrtho: a Fast, Memory-Efficient, Multiple Genome Orthology Classifier

10.1101/543223 ◽

2019 ◽

Author(s):

Xiao Hu ◽

Iddo Friedberg

Keyword(s):

Protein Function ◽

Large Scale ◽

Comparative Genomic ◽

Analysis Tool ◽

Bacterial Genomes ◽

Function Annotation ◽

Large Scale Data ◽

Protein Function Annotation ◽

Genome Analyses ◽

Memory Efficient

AbstractIntroductionGene homology type classification is a requisite for many types of genome analyses, including comparative genomics, phylogenetics, and protein function annotation. A large variety of tools have been developed to perform homology classification across genomes of different species. However, when applied to large genomic datasets, these tools require high memory and CPU usage, typically available only in costly computational clusters. To address this problem, we developed a new graph-based orthology analysis tool, SwiftOrtho, which is optimized for speed and memory usage when applied to large-scale data.ResultsIn our tests, SwiftOrtho is the only tool that completed orthology analysis of 1,760 bacterial genomes on a computer with only 4GB RAM. Using various standard orthology datasets, we also show that SwiftOrtho has a high accuracy. SwiftOrtho enables the accurate comparative genomic analyses of thousands of genomes using low memory computers.Availabilityhttps://github.com/Rinoahu/SwiftOrtho

Download Full-text