A Method for Improving the Accuracy and Efficiency of Bacteriophage Genome Annotation

Alicia Salisbury; Philippos K. Tsourkas

doi:10.3390/ijms20143391

A Method for Improving the Accuracy and Efficiency of Bacteriophage Genome Annotation

International Journal of Molecular Sciences ◽

10.3390/ijms20143391 ◽

2019 ◽

Vol 20 (14) ◽

pp. 3391 ◽

Cited By ~ 7

Author(s):

Alicia Salisbury ◽

Philippos K. Tsourkas

Keyword(s):

Decision Making ◽

Genome Annotation ◽

Phage Genome ◽

Labor Cost ◽

Additional Benefit ◽

Student Recruitment ◽

Manual Curation ◽

A Genome ◽

Bacteriophage Genome ◽

Genome Annotations

Bacteriophages are the most numerous entities on Earth. The number of sequenced phage genomes is approximately 8000 and increasing rapidly. Sequencing of a genome is followed by annotation, where genes, start codons, and functions are putatively identified. The mainstays of phage genome annotation are auto-annotation programs such as Glimmer and GeneMark. Due to the relatively small size of phage genomes, many groups choose to manually curate auto-annotation results to increase accuracy. An additional benefit of manual curation of auto-annotated phage genomes is that the process is amenable to be performed by students, and has been shown to improve student recruitment to the sciences. However, despite its greater accuracy and pedagogical value, manual curation suffers from high labor cost, lack of standardization and a degree of subjectivity in decision making, and susceptibility to mistakes. Here, we present a method developed in our lab that is designed to produce accurate annotations while reducing subjectivity and providing a degree of standardization in decision-making. We show that our method produces genome annotations more accurate than auto-annotation programs while retaining the pedagogical benefits of manual genome curation.

Download Full-text

Interoperable genome annotation with GBOL, an extendable infrastructure for functional data mining

10.1101/184747 ◽

2017 ◽

Cited By ~ 5

Author(s):

Jesse C.J. van Dam ◽

Jasper J. Koehorst ◽

Jon Olav Vik ◽

Peter J. Schaap ◽

Maria Suarez-Diez

Keyword(s):

Data Mining ◽

Genome Annotation ◽

Large Scale ◽

Data Provenance ◽

Formal Representation ◽

Plain Text ◽

Ontology Language ◽

Information Models ◽

A Genome ◽

Genome Annotations

BackgroundA standard structured format is used by the public sequence databases to present genome annotations. A prerequisite for a direct functional comparison is consistent annotation of the genetic elements with evidence statements. However, the current format provides limited support for data mining, hampering comparative analyses at large scale.ResultsThe provenance of a genome annotation describes the contextual details and derivation history of the process that resulted in the annotation. To enable interoperability of genome annotations, we have developed the Genome Biology Ontology Language (GBOL) and associated infrastructure (GBOL stack). GBOL is provenance aware and thus provides a consistent representation of functional genome annotations linked to the provenance. GBOL is modular in design, extendible and linked to existing ontologies. The GBOL stack of supporting tools enforces consistency within and between the GBOL definitions in the ontology (OWL) and the Shape Expressions (ShEx) language describing the graph structure. Modules have been developed to serialize the linked data (RDF) and to generate a plain text format files.ConclusionThe main rationale for applying formalized information models is to improve the exchange of information. GBOL uses and extends current ontologies to provide a formal representation of genomic entities, along with their properties and relations. The deliberate integration of data provenance in the ontology enables review of automatically obtained genome annotations at a large scale. The GBOL stack facilitates consistent usage of the ontology.

Download Full-text

Uncovering transcriptional dark matter via gene annotation independent single-cell RNA sequencing analysis

Nature Communications ◽

10.1038/s41467-021-22496-3 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Michael F. Z. Wang ◽

Madhav Mantri ◽

Shao-Pei Chou ◽

Gaetano J. Scuderi ◽

David W. McKellar ◽

...

Keyword(s):

Single Cell ◽

Genome Annotation ◽

Gene Annotation ◽

Active Regions ◽

Sequencing Analysis ◽

Biologically Relevant ◽

Mole Rat ◽

Genome Annotations ◽

Cell Expression ◽

High Quality Genome

AbstractConventional scRNA-seq expression analyses rely on the availability of a high quality genome annotation. Yet, as we show here with scRNA-seq experiments and analyses spanning human, mouse, chicken, mole rat, lemur and sea urchin, genome annotations are often incomplete, in particular for organisms that are not routinely studied. To overcome this hurdle, we created a scRNA-seq analysis routine that recovers biologically relevant transcriptional activity beyond the scope of the best available genome annotation by performing scRNA-seq analysis on any region in the genome for which transcriptional products are detected. Our tool generates a single-cell expression matrix for all transcriptionally active regions (TARs), performs single-cell TAR expression analysis to identify biologically significant TARs, and then annotates TARs using gene homology analysis. This procedure uses single-cell expression analyses as a filter to direct annotation efforts to biologically significant transcripts and thereby uncovers biology to which scRNA-seq would otherwise be in the dark.

Download Full-text

Phage Genome Annotation Using the RAST Pipeline

Methods in Molecular Biology - Bacteriophages ◽

10.1007/978-1-4939-7343-9_17 ◽

2017 ◽

pp. 231-238 ◽

Cited By ~ 21

Author(s):

Katelyn McNair ◽

Ramy Karam Aziz ◽

Gordon D. Pusch ◽

Ross Overbeek ◽

Bas E. Dutilh ◽

...

Keyword(s):

Genome Annotation ◽

Phage Genome

Download Full-text

Motley Crew: Overview of the Currently Available Phage Diversity

Frontiers in Microbiology ◽

10.3389/fmicb.2020.579452 ◽

2020 ◽

Vol 11 ◽

Author(s):

Nikita Zrelovs ◽

Andris Dislers ◽

Andris Kazaks

Keyword(s):

Complete Genome ◽

Phage Genome ◽

Gene Products ◽

Related Data ◽

Slowing Down ◽

The Past ◽

The World ◽

Bacteriophage Genome ◽

Biological Entities

The first complete genome that was sequenced at the beginning of the sequencing era was that of a phage, since then researchers throughout the world have been steadily describing and publishing genomes from a wide array of phages, uncovering the secrets of the most abundant and diverse biological entities known to man. Currently, we are experiencing an unprecedented rate of novel bacteriophage discovery, which can be seen from the fact that the amount of complete bacteriophage genome entries in public sequence repositories has more than doubled in the past 3 years and is steadily growing without showing any sign of slowing down. The amount of publicly available phage genome-related data can be overwhelming and has been summarized in literature before but quickly becomes out of date. Thus, the aim of this paper is to briefly outline currently available phage diversity data for public acknowledgment that could possibly encourage and stimulate future “depth” studies of particular groups of phages or their gene products.

Download Full-text

Comparative Genome Annotation Systems

Advanced Data Mining Technologies in Bioinformatics ◽

10.4018/978-1-59140-863-5.ch016 ◽

2011 ◽

pp. 296-313

Author(s):

Kwangmin Choi ◽

Sun Kim

Keyword(s):

Data Mining ◽

Sequence Analysis ◽

Genome Annotation ◽

Genome Comparison ◽

Comparative Genome ◽

A Genome ◽

Multiple Genomes

Understanding the genetic content of a genome is a very important but challenging task. One of the most effective methods to annotate a genome is to compare it to the genomes that are already sequenced and annotated. This chapter is to survey systems that can be used for annotating genomes by comparing multiple genomes and discusses important issues in designing genome comparison systems such as extensibility, scalability, reconfigurability, flexibility, usability, and data mining functionality. We also discuss briefly further issues in developing genome comparison systems where users can perform genome comparison flexibly on the sequence analysis level.

Download Full-text

EAnnot: A genome annotation tool using experimental evidence

Genome Research ◽

10.1101/gr.3152604 ◽

2004 ◽

Vol 14 (12) ◽

pp. 2503-2509 ◽

Cited By ~ 9

Author(s):

L. Ding

Keyword(s):

Experimental Evidence ◽

Genome Annotation ◽

Annotation Tool ◽

A Genome

Download Full-text

DEMETER: Efficient simultaneous curation of genome-scale reconstructions guided by experimental data and refined gene annotations

Bioinformatics ◽

10.1093/bioinformatics/btab622 ◽

2021 ◽

Author(s):

Almut Heinken ◽

Stefanía Magnúsdóttir ◽

Ronan M T Fleming ◽

Ines Thiele

Keyword(s):

Experimental Data ◽

Draft Genome ◽

Genomic Data ◽

Quality Standards ◽

Gene Annotations ◽

Manual Curation ◽

Genome Annotations ◽

Species Specific ◽

Genome Scale ◽

Cobra Toolbox

Abstract Motivation Manual curation of genome-scale reconstructions is laborious, yet existing automated curation tools do not typically take species-specific experimental and curated genomic data into account. Results We developed DEMETER, a COBRA Toolbox extension, that enables the efficient, simultaneous refinement of thousands of draft genome-scale reconstructions, while ensuring adherence to the quality standards in the field, agreement with available experimental data, and refinement of pathways based on manually refined genome annotations. Availability DEMETER and tutorials are freely available at https://github.com/opencobra.

Download Full-text

Deep genome annotation of the opportunistic human pathogenStreptococcus pneumoniaeD39

10.1101/283663 ◽

2018 ◽

Cited By ~ 1

Author(s):

Jelle Slager ◽

Rieza Aprianto ◽

Jan-Willem Veening

Keyword(s):

Single Molecule ◽

Genome Annotation ◽

Antigenic Variation ◽

Current Knowledge ◽

Bacterial Genome ◽

Treatment Strategies ◽

Human Pathogens ◽

Genome Data ◽

Manual Curation ◽

Automated Tools

ABSTRACTA precise understanding of the genomic organization into transcriptional units and their regulation is essential for our comprehension of opportunistic human pathogens and how they cause disease. Using single-molecule real-time (PacBio) sequencing we unambiguously determined the genome sequence ofStreptococcus pneumoniaestrain D39 and revealed several inversions previously undetected by short-read sequencing. Significantly, a chromosomal inversion results in antigenic variation of PhtD, an important surface-exposed virulence factor. We generated a new genome annotation using automated tools, followed by manual curation, reflecting the current knowledge in the field. By combining sequence-driven terminator prediction, deep paired-end transcriptome sequencing and enrichment of primary transcripts by Cappable-Seq, we mapped 1,015 transcriptional start sites and 748 termination sites. Using this new genomic map, we identified several new small RNAs (sRNAs), riboswitches (including twelve previously misidentified as sRNAs), and antisense RNAs. In total, we annotated 92 new protein-encoding genes, 39 sRNAs and 165 pseudogenes, bringing theS. pneumoniaeD39 repertoire to 2,151 genetic elements. We report operon structures and observed that 9% of operons lack a 5’-UTR. The genome data is accessible in an online resource called PneumoBrowse (https://veeninglab.com/pneumobrowse) providing one of the most complete inventories of a bacterial genome to date. PneumoBrowse will accelerate pneumococcal research and the development of new prevention and treatment strategies.

Download Full-text

A unified encyclopedia of human functional DNA elements through fully automated annotation of 164 human cell types

10.1101/086025 ◽

2016 ◽

Cited By ~ 5

Author(s):

Maxwell W. Libbrecht ◽

Oscar Rodriguez ◽

Zhiping Weng ◽

Jeffrey A. Bilmes ◽

Michael M. Hoffman ◽

...

Keyword(s):

Human Cell ◽

Genome Annotation ◽

Cell Types ◽

Regulatory Elements ◽

Activity Score ◽

Data Sets ◽

Cell Type ◽

Automated Annotation ◽

Aggregate Information ◽

Genome Annotations

AbstractSemi-automated genome annotation methods such as Segway enable understanding of chromatin activity. Here we present chromatin state annotations of 164 human cell types using 1,615 genomics data sets. To produce these annotations, we developed a fully-automated annotation strategy in which we train separate unsupervised annotation models on each cell type and use a machine learning classifier to automate the state interpretation step. Using these annotations, we developed a measure of the importance of each genomic position called the “conservation-associated activity score,” which we use to aggregate information across cell types into a multi-cell type view. The aggregated conservation-associated activity score provides a measure of importance directly attributable to a specific activity in a specific set of cell types. In contrast to evolutionary conservation, this measure is not biased to detect only elements shared with related species. Using the conservation-associated activity score, we combined all our annotations into a single, cell type-agnostic encyclopedia that catalogs all human transcriptional and regulatory elements, enabling easy and intuitive interpretation of the effect of genome variants on phenotype, such as in disease-associated, evolutionarily conserved or positively selected loci. These resources, including cell type-specific annotations, encyclopedia, and a visualization server, are available at http://noble.gs.washington.edu/proj/encyclopedia.Author SummaryGenome annotation algorithms are an effective class of tools for understanding the function of the genome. These algorithms take as input a set of genome-wide measurements about the activity at each base pair in a given tissue, such as where a given protein is binding or how accessible the DNA is to being read by a protein. The genome is then partitioned and each segment is assigned a label such that positions with the same label exhibit similar patterns in the input data. Such annotations are widely used for many applications, such as to understand the mechanism of impact of a given genetic variant. Here we present, to our knowledge, the most comprehensive set of genome annotations created so far, encompassing 164 human cell types and including 1,615 genomics data sets. These comprehensive annotations are made possible by a strategy that automates the previous interpretation step. Furthermore, we present several methodological innovations that make these genome annotations more useful.

Download Full-text

Semi-supervised segmentation and genome annotation

10.1101/2020.01.30.926923 ◽

2020 ◽

Author(s):

Rachel C.W. Chan ◽

Matthew McNeil ◽

Eric G. Roberts ◽

Mickaël Mendez ◽

Maxwell W. Libbrecht ◽

...

Keyword(s):

Supervised Learning ◽

Prior Knowledge ◽

Genome Annotation ◽

Whole Genome ◽

Transcription Start ◽

Transcription Start Sites ◽

Annotation Method ◽

Supervised Segmentation ◽

Unseen Data ◽

Genome Annotations

AbstractSegmentation and genome annotation methods automatically discover joint signal patterns in whole genome datasets. Previously, researchers trained these algorithms in a fully unsupervised way, with no prior knowledge of the functions of particular regions. Adding information provided by expert-created annotations to supervise training could improve the annotations created by these methods. We implemented semi-supervised learning using virtual evidence in the annotation method Segway. Additionally, we defined a positionally tolerant precision and recall metric for scoring genome annotations based on the proximity of each annotation feature to the truth set. We demonstrate semi-supervised Segway’s ability to learn patterns corresponding to provided transcription start sites on a specified supervision label, and subsequently recover other transcription start sites in unseen data on the same supervision label.

Download Full-text