Double triage to identify poorly annotated genes in maize: The missing link in community curation

Mapping Intimacies ◽

10.1101/654848 ◽

2019 ◽

Author(s):

Marcela K. Tello-Ruiz ◽

Cristina F. Marco ◽

Fei-Man Hsu ◽

Rajdeep S. Khangura ◽

Pengfei Qiao ◽

...

Keyword(s):

Gene Annotation ◽

Gene Prediction ◽

Gene Tree ◽

Quality Metrics ◽

Maize Genome ◽

Protein Coding ◽

Transcript Model ◽

Manual Curation ◽

P Gene ◽

Community Curation

AbstractThe sophistication of gene prediction algorithms and the abundance of RNA-based evidence for the maize genome may suggest that manual curation of gene models is no longer necessary. However, quality metrics generated by the MAKER-P gene annotation pipeline identified 17,225 of 130,330 (13%) protein-coding transcripts in the B73 Reference Genome V4 gene set with models of low concordance to available biological evidence. Working with eight graduate students, we used the Apollo annotation editor to curate 86 transcript models flagged by quality metrics and a complimentary method using the Gramene gene tree visualizer. All of the triaged models had significant errors – including missing or extra exons, non-canonical splice sites, and incorrect UTRs. A correct transcript model existed for about 60% of genes (or transcripts) flagged by quality metrics; we attribute this to the convention of elevating the transcript with the longest coding sequence (CDS) to the canonical, or first, position. The remaining 40% of flagged genes resulted in novel annotations and represent a manual curation space of about 10% of the maize genome (~4,000 protein-coding genes). MAKER-P metrics have a specificity of 100%, and a sensitivity of 85%; the gene tree visualizer has a specificity of 100%. Together with the Apollo graphical editor, our double triage provides an infrastructure to support the community curation of eukaryotic genomes by scientists, students, and potentially even citizen scientists.

Download Full-text

Loss of critical developmental and human disease-causing genes in 58 mammals

10.1101/819169 ◽

2019 ◽

Author(s):

Yatish Turakhia ◽

Heidi I. Chen ◽

Amir Marcovitz ◽

Gill Bejerano

Keyword(s):

Evolutionary Biology ◽

Large Scale ◽

Gene Annotation ◽

Synonymous Substitution ◽

Specific Gene ◽

High Confidence ◽

Protein Coding ◽

Congenital Diseases ◽

Manual Curation ◽

Human Genes

Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools and protein databases focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (deletion and non-synonymous substitution) as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence protein-coding gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using the hg38 human assembly as a reference, we discovered over 500 unique human genes affected by such high-confidence erosion events in different clades across 58 mammals. While most of these events likely have benign consequences, we also found dozens of clade-specific gene losses that result in early lethality in outgroup mammals or are associated with severe congenital diseases in humans. Our discoveries yield intriguing potential for translational medical genetics and for evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.

Download Full-text

GeneMark-HM: improving gene prediction in DNA sequences of human microbiome

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab047 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Alexandre Lomsadze ◽

Christophe Bonny ◽

Francesco Strozzi ◽

Mark Borodovsky

Keyword(s):

Dna Sequences ◽

De Novo ◽

Gene Annotation ◽

Selection Process ◽

Gene Prediction ◽

Bacterial Species ◽

Human Microbiome ◽

Genome Database ◽

Protein Coding ◽

Pan Genome

Abstract Computational reconstruction of nearly complete genomes from metagenomic reads may identify thousands of new uncultured candidate bacterial species. We have shown that reconstructed prokaryotic genomes along with genomes of sequenced microbial isolates can be used to support more accurate gene prediction in novel metagenomic sequences. We have proposed an approach that used three types of gene prediction algorithms and found for all contigs in a metagenome nearly optimal models of protein-coding regions either in libraries of pre-computed models or constructed de novo. The model selection process and gene annotation were done by the new GeneMark-HM pipeline. We have created a database of the species level pan-genomes for the human microbiome. To create a library of models representing each pan-genome we used a self-training algorithm GeneMarkS-2. Genes initially predicted in each contig served as queries for a fast similarity search through the pan-genome database. The best matches led to selection of the model for gene prediction. Contigs not assigned to pan-genomes were analyzed by crude, but still accurate models designed for sequences with particular GC compositions. Tests of GeneMark-HM on simulated metagenomes demonstrated improvement in gene annotation of human metagenomic sequences in comparison with the current state-of-the-art gene prediction tools.

Download Full-text

Gene prediction by multiple syntenic alignment

Journal of Integrative Bioinformatics ◽

10.1515/jib-2005-13 ◽

2005 ◽

Vol 2 (1) ◽

pp. 38-47

Author(s):

Said S. Adi ◽

Carlos E. Ferreira

Keyword(s):

Computer Program ◽

Genomic Dna ◽

Gene Prediction ◽

Genomic Sequences ◽

Prediction Problem ◽

Huge Amount ◽

Protein Coding ◽

Coding Regions ◽

Conserved Regions ◽

Similarity Information

Summary Given the increasing number of available genomic sequences, one now faces the task of identifying their functional parts, like the protein coding regions. The gene prediction problem can be addressed in several ways. One of the most promising methods makes use of similarity information between the genomic DNA and previously annotated sequences (proteins, cDNAs and ESTs). Recently, given the huge amount of newly sequenced genomes, new similarity-based methods are being successfully applied in the task of gene prediction. The so-called comparative-based methods lie in the similarities shared by regions of two evolutionary related genomic sequences. Despite the number of different gene prediction approaches in the literature, this problem remains challenging. In this paper we present a new comparative-based approach to the gene prediction problem. It is based on a syntenic alignment of three or more genomic sequences. With syntenic alignment we mean an alignment that is constructed taking into account the fact that the involved sequences include conserved regions intervened by unconserved ones. We have implemented the proposed algorithm in a computer program and confirm the validity of the approach on a benchmark including triples of human, mouse and rat genomic sequences.

Download Full-text

ASPic-GeneID: A Lightweight Pipeline for Gene Prediction and Alternative Isoforms Detection

BioMed Research International ◽

10.1155/2013/502827 ◽

2013 ◽

Vol 2013 ◽

pp. 1-11 ◽

Cited By ~ 4

Author(s):

Tyler Alioto ◽

Ernesto Picardi ◽

Roderic Guigó ◽

Graziano Pesole

Keyword(s):

Ab Initio ◽

Gene Annotation ◽

Gene Prediction ◽

Gene Prediction Program ◽

C Elegans ◽

Prediction Program ◽

First Pass ◽

Main Components ◽

Genome Projects ◽

Alternative Isoforms

New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurateab initiogene prediction methods. However, it is apparent that fullyab initiomethods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entireC. elegansgenome and the 44 ENCODE human pilot regions.

Download Full-text

EnTAP: Bringing Faster and Smarter Functional Annotation to Non-Model Eukaryotic Transcriptomes

10.1101/307868 ◽

2018 ◽

Cited By ~ 5

Author(s):

Alexander J. Hart ◽

Samuel Ginzburg ◽

Muyang (Sam) Xu ◽

Cera R. Fisher ◽

Nasim Rahmatpour ◽

...

Keyword(s):

Similarity Search ◽

De Novo ◽

Gene Annotation ◽

Enrichment Analysis ◽

Orthologous Gene ◽

Protein Domain ◽

Family Assessment ◽

Ontology Term ◽

Protein Coding ◽

Functional Gene Annotation

ABSTRACTEnTAP (Eukaryotic Non-Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates, while focusing primarily on protein-coding transcripts. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins. Downstream features include fast similarity search across three repositories, protein domain assignment, orthologous gene family assessment, and Gene Ontology term assignment. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove contaminants, identify associated pathways, and prepare the transcripts for enrichment analysis. This fully featured pipeline is easy to install, configure, and runs significantly faster than comparable annotation packages. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources.

Download Full-text

The genome sequence of the European peacock butterfly, Aglais io (Linnaeus, 1758)

Wellcome Open Research ◽

10.12688/wellcomeopenres.17204.1 ◽

2021 ◽

Vol 6 ◽

pp. 258

Author(s):

Konrad Lohse ◽

Alexander Mackintosh ◽

Roger Vila ◽

◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Sex Chromosome ◽

Gene Annotation ◽

Protein Coding ◽

Individual Male ◽

Protein Coding Genes ◽

A Genome ◽

Inachis Io

We present a genome assembly from an individual male Aglais io (also known as Inachis io and Nymphalis io) (the European peacock; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 384 megabases in span. The majority (99.91%) of the assembly is scaffolded into 31 chromosomal pseudomolecules, with the Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 11,420 protein coding genes.

Download Full-text

PHANOTATE: a novel approach to gene identification in phage genomes

Bioinformatics ◽

10.1093/bioinformatics/btz265 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4537-4542 ◽

Cited By ~ 24

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Elizabeth A Dinsdale ◽

Brian Souza ◽

Robert A Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Supplementary Information ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach

Abstract Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GPRED-GC: a Gene PREDiction model accounting for 5 ′- 3′ GC gradient

BMC Bioinformatics ◽

10.1186/s12859-019-3047-3 ◽

2019 ◽

Vol 20 (S15) ◽

Cited By ~ 1

Author(s):

Prapaporn Techa-Angkoon ◽

Kevin L. Childs ◽

Yanni Sun

Keyword(s):

Ab Initio ◽

Gene Annotation ◽

Gene Prediction ◽

Source Code ◽

Gc Content ◽

Prediction Tools ◽

Homologous Sequences ◽

Manual Intervention ◽

Grass Genomes ◽

Gc Contents

Abstract Background Gene is a key step in genome annotation. Ab initio gene prediction enables gene annotation of new genomes regardless of availability of homologous sequences. There exist a number of ab initio gene prediction tools and they have been widely used for gene annotation for various species. However, existing tools are not optimized for identifying genes with highly variable GC content. In addition, some genes in grass genomes exhibit a sharp 5 ′- 3′ decreasing GC content gradient, which is not carefully modeled by available gene prediction tools. Thus, there is still room to improve the sensitivity and accuracy for predicting genes with GC gradients. Results In this work, we designed and implemented a new hidden Markov model (HMM)-based ab initio gene prediction tool, which is optimized for finding genes with highly variable GC contents, such as the genes with negative GC gradients in grass genomes. We tested the tool on three datasets from Arabidopsis thaliana and Oryza sativa. The results showed that our tool can identify genes missed by existing tools due to the highly variable GC contents. Conclusions GPRED-GC can effectively predict genes with highly variable GC contents without manual intervention. It provides a useful complementary tool to existing ones such as Augustus for more sensitive gene discovery. The source code is freely available at https://sourceforge.net/projects/gpred-gc/.

Download Full-text

A fully-automated method discovers loss of mouse-lethal and human-monogenic disease genes in 58 mammals

Nucleic Acids Research ◽

10.1093/nar/gkaa550 ◽

2020 ◽

Vol 48 (16) ◽

pp. e91-e91

Author(s):

Yatish Turakhia ◽

Heidi I Chen ◽

Amir Marcovitz ◽

Gill Bejerano

Keyword(s):

Evolutionary Biology ◽

Large Scale ◽

Gene Annotation ◽

Monogenic Disease ◽

Disease Genes ◽

Congenital Diseases ◽

Manual Curation ◽

Automated Method ◽

Human Ortholog ◽

Early Mouse

Abstract Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (amino acid deletions and substitutions) and sister species support as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using human as reference, we discovered over 400 unique human ortholog erosion events across 58 mammals. This includes dozens of clade-specific losses of genes that result in early mouse lethality or are associated with severe human congenital diseases. Our discoveries yield intriguing potential for translational medical genetics and evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.

Download Full-text

Genome Improvement and Core Gene Set Refinement of Fugacium kawagutii

Microorganisms ◽

10.3390/microorganisms8010102 ◽

2020 ◽

Vol 8 (1) ◽

pp. 102 ◽

Cited By ~ 3

Author(s):

Tangcheng Li ◽

Liying Yu ◽

Bo Song ◽

Yue Song ◽

Ling Li ◽

...

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Gene Prediction ◽

Acid Synthesis ◽

Core Gene ◽

Chromosome Conformation ◽

Gene Set ◽

Kegg Pathways ◽

Expression Studies ◽

Gene Expression Studies

Cataloging an accurate functional gene set for the Symbiodiniaceae species is crucial for addressing biological questions of dinoflagellate symbiosis with corals and other invertebrates. To improve the gene models of Fugacium kawagutii, we conducted high-throughput chromosome conformation capture (Hi-C) for the genome and Illumina combined with PacBio sequencing for the transcriptome to achieve a new genome assembly and gene prediction. A 0.937-Gbp assembly of F. kawagutii were obtained, with a N50 > 13 Mbp and the longest scaffold of 121 Mbp capped with telomere motif at both ends. Gene annotation produced 45,192 protein-coding genes, among which, 11,984 are new compared to previous versions of the genome. The newly identified genes are mainly enriched in 38 KEGG pathways including N-Glycan biosynthesis, mRNA surveillance pathway, cell cycle, autophagy, mitophagy, and fatty acid synthesis, which are important for symbiosis, nutrition, and reproduction. The newly identified genes also included those encoding O-methyltransferase (O-MT), 3-dehydroquinate synthase, homologous-pairing protein 2-like (HOP2) and meiosis protein 2 (MEI2), which function in mycosporine-like amino acids (MAAs) biosynthesis and sexual reproduction, respectively. The improved version of the gene set (Fugka_Geneset _V3) raised transcriptomic read mapping rate from 33% to 54% and BUSCO match from 29% to 55%. Further differential gene expression analysis yielded a set of stably expressed genes under variable trace metal conditions, of which 115 with annotated functions have recently been found to be stably expressed under three other conditions, thus further developing the “core gene set” of F. kawagutii. This improved genome will prove useful for future Symbiodiniaceae transcriptomic, gene structure, and gene expression studies, and the refined “core gene set” will be a valuable resource from which to develop reference genes for gene expression studies.

Download Full-text