EAnnot: A genome annotation tool using experimental evidence

L. Ding

doi:10.1101/gr.3152604

Comparative Genome Annotation Systems

Advanced Data Mining Technologies in Bioinformatics ◽

10.4018/978-1-59140-863-5.ch016 ◽

2011 ◽

pp. 296-313

Author(s):

Kwangmin Choi ◽

Sun Kim

Keyword(s):

Data Mining ◽

Sequence Analysis ◽

Genome Annotation ◽

Genome Comparison ◽

Comparative Genome ◽

A Genome ◽

Multiple Genomes

Understanding the genetic content of a genome is a very important but challenging task. One of the most effective methods to annotate a genome is to compare it to the genomes that are already sequenced and annotated. This chapter is to survey systems that can be used for annotating genomes by comparing multiple genomes and discusses important issues in designing genome comparison systems such as extensibility, scalability, reconfigurability, flexibility, usability, and data mining functionality. We also discuss briefly further issues in developing genome comparison systems where users can perform genome comparison flexibly on the sequence analysis level.

Download Full-text

SigmoID: a user-friendly tool for improving bacterial genome annotation through analysis of transcription control signals

PeerJ ◽

10.7717/peerj.2056 ◽

2016 ◽

Vol 4 ◽

pp. e2056 ◽

Cited By ~ 4

Author(s):

Yevgeny Nikolaichik ◽

Aliaksandr U. Damienikan

Keyword(s):

Transcription Factor ◽

Binding Sites ◽

Genome Annotation ◽

Bacterial Genome ◽

Transcription Factor Binding Sites ◽

Transcription Factor Binding ◽

Factor Binding ◽

A Genome ◽

Regulatory Information ◽

User Friendly

The majority of bacterial genome annotations are currently automated and based on a ‘gene by gene’ approach. Regulatory signals and operon structures are rarely taken into account which often results in incomplete and even incorrect gene function assignments. Here we present SigmoID, a cross-platform (OS X, Linux and Windows) open-source application aiming at simplifying the identification of transcription regulatory sites (promoters, transcription factor binding sites and terminators) in bacterial genomes and providing assistance in correcting annotations in accordance with regulatory information. SigmoID combines a user-friendly graphical interface to well known command line tools with a genome browser for visualising regulatory elements in genomic context. Integrated access to online databases with regulatory information (RegPrecise and RegulonDB) and web-based search engines speeds up genome analysis and simplifies correction of genome annotation. We demonstrate some features of SigmoID by constructing a series of regulatory protein binding site profiles for two groups of bacteria: Soft RotEnterobacteriaceae(PectobacteriumandDickeyaspp.) andPseudomonasspp. Furthermore, we inferred over 900 transcription factor binding sites and alternative sigma factor promoters in the annotated genome ofPectobacterium atrosepticum. These regulatory signals control putative transcription units covering about 40% of theP. atrosepticumchromosome. Reviewing the annotation in cases where it didn’t fit with regulatory information allowed us to correct product and gene names for over 300 loci.

Download Full-text

Interoperable genome annotation with GBOL, an extendable infrastructure for functional data mining

10.1101/184747 ◽

2017 ◽

Cited By ~ 5

Author(s):

Jesse C.J. van Dam ◽

Jasper J. Koehorst ◽

Jon Olav Vik ◽

Peter J. Schaap ◽

Maria Suarez-Diez

Keyword(s):

Data Mining ◽

Genome Annotation ◽

Large Scale ◽

Data Provenance ◽

Formal Representation ◽

Plain Text ◽

Ontology Language ◽

Information Models ◽

A Genome ◽

Genome Annotations

BackgroundA standard structured format is used by the public sequence databases to present genome annotations. A prerequisite for a direct functional comparison is consistent annotation of the genetic elements with evidence statements. However, the current format provides limited support for data mining, hampering comparative analyses at large scale.ResultsThe provenance of a genome annotation describes the contextual details and derivation history of the process that resulted in the annotation. To enable interoperability of genome annotations, we have developed the Genome Biology Ontology Language (GBOL) and associated infrastructure (GBOL stack). GBOL is provenance aware and thus provides a consistent representation of functional genome annotations linked to the provenance. GBOL is modular in design, extendible and linked to existing ontologies. The GBOL stack of supporting tools enforces consistency within and between the GBOL definitions in the ontology (OWL) and the Shape Expressions (ShEx) language describing the graph structure. Modules have been developed to serialize the linked data (RDF) and to generate a plain text format files.ConclusionThe main rationale for applying formalized information models is to improve the exchange of information. GBOL uses and extends current ontologies to provide a formal representation of genomic entities, along with their properties and relations. The deliberate integration of data provenance in the ontology enables review of automatically obtained genome annotations at a large scale. The GBOL stack facilitates consistent usage of the ontology.

Download Full-text

First annotated genome of a mandibulate moth, Neomicropteryx cornuta, generated using PacBio HiFi sequencing

Genome Biology and Evolution ◽

10.1093/gbe/evab229 ◽

2021 ◽

Author(s):

Xuankun Li ◽

Emily Ellis ◽

David Plotkin ◽

Yume Imada ◽

Masaya Yago ◽

...

Keyword(s):

Genome Assembly ◽

Genome Annotation ◽

Single Copy ◽

Early Evolution ◽

High Recovery ◽

High Quality ◽

Evolutionary Transitions ◽

A Genome ◽

High Quality Genome ◽

Papilio Polytes

Abstract We provide a new, annotated genome assembly of Neomicropteryx cornuta, a species of the so-called “mandibulate archaic moths” (Lepidoptera: Micropterigidae). These moths belong to a lineage that is thought to have split from all other Lepidoptera more than 300 million years ago and are consequently vital to understanding the early evolution of superorder Amphiesmenoptera, which contains the order Lepidoptera (butterflies and moths) and its sister order Trichoptera (caddisflies). Using PacBio HiFi sequencing reads, we assembled a highly-contiguous genome with a contig N50 of nearly 17 Mbp. The assembled genome length of 541,115,538 bp is about half the length of the largest published Amphiesmenoptera genome (Limnephilus lunatus, Trichoptera) and double the length of the smallest (Papilio polytes, Lepidoptera). We find high recovery of universal single copy orthologs with 98.1% of BUSCO genes present and provide a genome annotation of 15,643 genes aided by resolved isoforms from PacBio IsoSeq data. This high-quality genome assembly provides an important resource for studying ecological and evolutionary transitions in the early evolution of Amphiesmenoptera.

Download Full-text

PuMA: a papillomavirus genome annotation tool

10.1101/736991 ◽

2019 ◽

Author(s):

J. Pace ◽

K. Youens-Clark ◽

C. Freeman ◽

B. Hurwitz ◽

K. Van Doorslaer

Keyword(s):

Genome Annotation ◽

High Throughput Sequencing ◽

Viral Metagenomics ◽

Annotation Tool ◽

General Applicability ◽

Viral Genomes ◽

Link Type ◽

Sequencing Technologies ◽

Reproducible Method ◽

Analytical Approaches

ABSTRACTHigh-throughput sequencing technologies provide unprecedented power to identify novel viruses from a wide variety of (environmental) samples. The field of ‘viral metagenomics’ has dramatically expanded our understanding of viral diversity. Viral metagenomic approaches imply that many novel viruses will not be described by researchers who are experts on the genomic organization of that virus. There is a need to develop analytical approaches to reconstruct, annotate, and classify viral genomes. We have developed the papillomavirus annotation tool (PuMA) to provide researchers with a convenient and reproducible method to annotate novel papillomaviruses. PuMA provides an accessible method for automated papillomavirus genome annotation. PuMA currently has a 98% accuracy when benchmarked against the 481 reference genomes in the papillomavirus episteme (PaVE). Finally, PuMA was used to annotate 168 newly isolated papillomaviruses, and successfully annotated 1424 viral features. To demonstrate its general applicability, we developed a version of PuMA that can annotate polyomaviruses.PuMA is available on GitHub (https://github.com/KVD-lab/puma) and through the iMicrobe online environment (https://www.imicrobe.us/#/apps/puma)

Download Full-text

A Method for Improving the Accuracy and Efficiency of Bacteriophage Genome Annotation

International Journal of Molecular Sciences ◽

10.3390/ijms20143391 ◽

2019 ◽

Vol 20 (14) ◽

pp. 3391 ◽

Cited By ~ 7

Author(s):

Alicia Salisbury ◽

Philippos K. Tsourkas

Keyword(s):

Decision Making ◽

Genome Annotation ◽

Phage Genome ◽

Labor Cost ◽

Additional Benefit ◽

Student Recruitment ◽

Manual Curation ◽

A Genome ◽

Bacteriophage Genome ◽

Genome Annotations

Bacteriophages are the most numerous entities on Earth. The number of sequenced phage genomes is approximately 8000 and increasing rapidly. Sequencing of a genome is followed by annotation, where genes, start codons, and functions are putatively identified. The mainstays of phage genome annotation are auto-annotation programs such as Glimmer and GeneMark. Due to the relatively small size of phage genomes, many groups choose to manually curate auto-annotation results to increase accuracy. An additional benefit of manual curation of auto-annotated phage genomes is that the process is amenable to be performed by students, and has been shown to improve student recruitment to the sciences. However, despite its greater accuracy and pedagogical value, manual curation suffers from high labor cost, lack of standardization and a degree of subjectivity in decision making, and susceptibility to mistakes. Here, we present a method developed in our lab that is designed to produce accurate annotations while reducing subjectivity and providing a degree of standardization in decision-making. We show that our method produces genome annotations more accurate than auto-annotation programs while retaining the pedagogical benefits of manual genome curation.

Download Full-text

Data mining parasite genomes

Parasitology ◽

10.1017/s0031182004006857 ◽

2004 ◽

Vol 128 (S1) ◽

pp. S23-S31 ◽

Cited By ~ 3

Author(s):

M. BERRIMAN

Keyword(s):

Data Mining ◽

Genome Annotation ◽

Sequence Data ◽

Genome Project ◽

Coding Regions ◽

A Genome ◽

Term Data ◽

Lines Of Evidence ◽

Large Background

The term ‘data mining’ can be used to describe any process where useful information is extracted from data with a large background of ‘noise’. In the context of a genome project, several stages involve data mining. Amongst the sequence data, ‘signals’ need to be detected that indicate the presence of interesting features. Often this involves differentiating between transcribed and non-transcribed bases to predict coding regions. After detection, defining the roles of these sequences involves sifting through multiple lines of evidence. If these roles are accurately reflected in genome annotation, they can be used by researchers to frame queries and interrogate the data further.

Download Full-text

AnnotationSketch: a genome annotation drawing library

Bioinformatics ◽

10.1093/bioinformatics/btn657 ◽

2008 ◽

Vol 25 (4) ◽

pp. 533-534 ◽

Cited By ~ 21

Author(s):

Sascha Steinbiss ◽

Gordon Gremme ◽

Christin Schärfer ◽

Malte Mader ◽

Stefan Kurtz

Keyword(s):

Genome Annotation ◽

A Genome

Download Full-text

Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell type-specific expression

10.1101/009209 ◽

2014 ◽

Cited By ~ 1

Author(s):

Maxwell W Libbrecht ◽

Ferhat Ay ◽

Michael M Hoffman ◽

David M Gilbert ◽

Jeffrey A Bilmes ◽

...

Keyword(s):

Genome Annotation ◽

Cell Types ◽

Computational Method ◽

Data Sets ◽

Chromatin Conformation ◽

Specific Expression ◽

Regulatory Domains ◽

A Genome ◽

Cell Type Specific Expression ◽

Cell Type Specific

The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation, in which regions of hundreds or thousands of kilobases known as domains are regulated as a unit. Previous studies using genomics assays such as chromatin immunoprecipitation (ChIP)-seq and chromatin conformation capture (3C)-based assays have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods can incorporate only data sets that can be expressed as a one-dimensional vector over the genome and therefore cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a comprehensive model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly-regulated genes expressed in only a small number of cell types, which we term "specific expression domains." We additionally found that a subset of domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used for the seemingly unrelated task of transferring information from well-studied cell types to less well characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data.

Download Full-text