scholarly journals A Method for Improving the Accuracy and Efficiency of Bacteriophage Genome Annotation

2019 ◽  
Vol 20 (14) ◽  
pp. 3391 ◽  
Author(s):  
Alicia Salisbury ◽  
Philippos K. Tsourkas

Bacteriophages are the most numerous entities on Earth. The number of sequenced phage genomes is approximately 8000 and increasing rapidly. Sequencing of a genome is followed by annotation, where genes, start codons, and functions are putatively identified. The mainstays of phage genome annotation are auto-annotation programs such as Glimmer and GeneMark. Due to the relatively small size of phage genomes, many groups choose to manually curate auto-annotation results to increase accuracy. An additional benefit of manual curation of auto-annotated phage genomes is that the process is amenable to be performed by students, and has been shown to improve student recruitment to the sciences. However, despite its greater accuracy and pedagogical value, manual curation suffers from high labor cost, lack of standardization and a degree of subjectivity in decision making, and susceptibility to mistakes. Here, we present a method developed in our lab that is designed to produce accurate annotations while reducing subjectivity and providing a degree of standardization in decision-making. We show that our method produces genome annotations more accurate than auto-annotation programs while retaining the pedagogical benefits of manual genome curation.

2017 ◽  
Author(s):  
Jesse C.J. van Dam ◽  
Jasper J. Koehorst ◽  
Jon Olav Vik ◽  
Peter J. Schaap ◽  
Maria Suarez-Diez

BackgroundA standard structured format is used by the public sequence databases to present genome annotations. A prerequisite for a direct functional comparison is consistent annotation of the genetic elements with evidence statements. However, the current format provides limited support for data mining, hampering comparative analyses at large scale.ResultsThe provenance of a genome annotation describes the contextual details and derivation history of the process that resulted in the annotation. To enable interoperability of genome annotations, we have developed the Genome Biology Ontology Language (GBOL) and associated infrastructure (GBOL stack). GBOL is provenance aware and thus provides a consistent representation of functional genome annotations linked to the provenance. GBOL is modular in design, extendible and linked to existing ontologies. The GBOL stack of supporting tools enforces consistency within and between the GBOL definitions in the ontology (OWL) and the Shape Expressions (ShEx) language describing the graph structure. Modules have been developed to serialize the linked data (RDF) and to generate a plain text format files.ConclusionThe main rationale for applying formalized information models is to improve the exchange of information. GBOL uses and extends current ontologies to provide a formal representation of genomic entities, along with their properties and relations. The deliberate integration of data provenance in the ontology enables review of automatically obtained genome annotations at a large scale. The GBOL stack facilitates consistent usage of the ontology.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Michael F. Z. Wang ◽  
Madhav Mantri ◽  
Shao-Pei Chou ◽  
Gaetano J. Scuderi ◽  
David W. McKellar ◽  
...  

AbstractConventional scRNA-seq expression analyses rely on the availability of a high quality genome annotation. Yet, as we show here with scRNA-seq experiments and analyses spanning human, mouse, chicken, mole rat, lemur and sea urchin, genome annotations are often incomplete, in particular for organisms that are not routinely studied. To overcome this hurdle, we created a scRNA-seq analysis routine that recovers biologically relevant transcriptional activity beyond the scope of the best available genome annotation by performing scRNA-seq analysis on any region in the genome for which transcriptional products are detected. Our tool generates a single-cell expression matrix for all transcriptionally active regions (TARs), performs single-cell TAR expression analysis to identify biologically significant TARs, and then annotates TARs using gene homology analysis. This procedure uses single-cell expression analyses as a filter to direct annotation efforts to biologically significant transcripts and thereby uncovers biology to which scRNA-seq would otherwise be in the dark.


Author(s):  
Katelyn McNair ◽  
Ramy Karam Aziz ◽  
Gordon D. Pusch ◽  
Ross Overbeek ◽  
Bas E. Dutilh ◽  
...  

2020 ◽  
Vol 11 ◽  
Author(s):  
Nikita Zrelovs ◽  
Andris Dislers ◽  
Andris Kazaks

The first complete genome that was sequenced at the beginning of the sequencing era was that of a phage, since then researchers throughout the world have been steadily describing and publishing genomes from a wide array of phages, uncovering the secrets of the most abundant and diverse biological entities known to man. Currently, we are experiencing an unprecedented rate of novel bacteriophage discovery, which can be seen from the fact that the amount of complete bacteriophage genome entries in public sequence repositories has more than doubled in the past 3 years and is steadily growing without showing any sign of slowing down. The amount of publicly available phage genome-related data can be overwhelming and has been summarized in literature before but quickly becomes out of date. Thus, the aim of this paper is to briefly outline currently available phage diversity data for public acknowledgment that could possibly encourage and stimulate future “depth” studies of particular groups of phages or their gene products.


Author(s):  
Kwangmin Choi ◽  
Sun Kim

Understanding the genetic content of a genome is a very important but challenging task. One of the most effective methods to annotate a genome is to compare it to the genomes that are already sequenced and annotated. This chapter is to survey systems that can be used for annotating genomes by comparing multiple genomes and discusses important issues in designing genome comparison systems such as extensibility, scalability, reconfigurability, flexibility, usability, and data mining functionality. We also discuss briefly further issues in developing genome comparison systems where users can perform genome comparison flexibly on the sequence analysis level.


Author(s):  
Almut Heinken ◽  
Stefanía Magnúsdóttir ◽  
Ronan M T Fleming ◽  
Ines Thiele

Abstract Motivation Manual curation of genome-scale reconstructions is laborious, yet existing automated curation tools do not typically take species-specific experimental and curated genomic data into account. Results We developed DEMETER, a COBRA Toolbox extension, that enables the efficient, simultaneous refinement of thousands of draft genome-scale reconstructions, while ensuring adherence to the quality standards in the field, agreement with available experimental data, and refinement of pathways based on manually refined genome annotations. Availability DEMETER and tutorials are freely available at https://github.com/opencobra.


2018 ◽  
Author(s):  
Jelle Slager ◽  
Rieza Aprianto ◽  
Jan-Willem Veening

ABSTRACTA precise understanding of the genomic organization into transcriptional units and their regulation is essential for our comprehension of opportunistic human pathogens and how they cause disease. Using single-molecule real-time (PacBio) sequencing we unambiguously determined the genome sequence ofStreptococcus pneumoniaestrain D39 and revealed several inversions previously undetected by short-read sequencing. Significantly, a chromosomal inversion results in antigenic variation of PhtD, an important surface-exposed virulence factor. We generated a new genome annotation using automated tools, followed by manual curation, reflecting the current knowledge in the field. By combining sequence-driven terminator prediction, deep paired-end transcriptome sequencing and enrichment of primary transcripts by Cappable-Seq, we mapped 1,015 transcriptional start sites and 748 termination sites. Using this new genomic map, we identified several new small RNAs (sRNAs), riboswitches (including twelve previously misidentified as sRNAs), and antisense RNAs. In total, we annotated 92 new protein-encoding genes, 39 sRNAs and 165 pseudogenes, bringing theS. pneumoniaeD39 repertoire to 2,151 genetic elements. We report operon structures and observed that 9% of operons lack a 5’-UTR. The genome data is accessible in an online resource called PneumoBrowse (https://veeninglab.com/pneumobrowse) providing one of the most complete inventories of a bacterial genome to date. PneumoBrowse will accelerate pneumococcal research and the development of new prevention and treatment strategies.


2016 ◽  
Author(s):  
Maxwell W. Libbrecht ◽  
Oscar Rodriguez ◽  
Zhiping Weng ◽  
Jeffrey A. Bilmes ◽  
Michael M. Hoffman ◽  
...  

AbstractSemi-automated genome annotation methods such as Segway enable understanding of chromatin activity. Here we present chromatin state annotations of 164 human cell types using 1,615 genomics data sets. To produce these annotations, we developed a fully-automated annotation strategy in which we train separate unsupervised annotation models on each cell type and use a machine learning classifier to automate the state interpretation step. Using these annotations, we developed a measure of the importance of each genomic position called the “conservation-associated activity score,” which we use to aggregate information across cell types into a multi-cell type view. The aggregated conservation-associated activity score provides a measure of importance directly attributable to a specific activity in a specific set of cell types. In contrast to evolutionary conservation, this measure is not biased to detect only elements shared with related species. Using the conservation-associated activity score, we combined all our annotations into a single, cell type-agnostic encyclopedia that catalogs all human transcriptional and regulatory elements, enabling easy and intuitive interpretation of the effect of genome variants on phenotype, such as in disease-associated, evolutionarily conserved or positively selected loci. These resources, including cell type-specific annotations, encyclopedia, and a visualization server, are available at http://noble.gs.washington.edu/proj/encyclopedia.Author SummaryGenome annotation algorithms are an effective class of tools for understanding the function of the genome. These algorithms take as input a set of genome-wide measurements about the activity at each base pair in a given tissue, such as where a given protein is binding or how accessible the DNA is to being read by a protein. The genome is then partitioned and each segment is assigned a label such that positions with the same label exhibit similar patterns in the input data. Such annotations are widely used for many applications, such as to understand the mechanism of impact of a given genetic variant. Here we present, to our knowledge, the most comprehensive set of genome annotations created so far, encompassing 164 human cell types and including 1,615 genomics data sets. These comprehensive annotations are made possible by a strategy that automates the previous interpretation step. Furthermore, we present several methodological innovations that make these genome annotations more useful.


2020 ◽  
Author(s):  
Rachel C.W. Chan ◽  
Matthew McNeil ◽  
Eric G. Roberts ◽  
Mickaël Mendez ◽  
Maxwell W. Libbrecht ◽  
...  

AbstractSegmentation and genome annotation methods automatically discover joint signal patterns in whole genome datasets. Previously, researchers trained these algorithms in a fully unsupervised way, with no prior knowledge of the functions of particular regions. Adding information provided by expert-created annotations to supervise training could improve the annotations created by these methods. We implemented semi-supervised learning using virtual evidence in the annotation method Segway. Additionally, we defined a positionally tolerant precision and recall metric for scoring genome annotations based on the proximity of each annotation feature to the truth set. We demonstrate semi-supervised Segway’s ability to learn patterns corresponding to provided transcription start sites on a specified supervision label, and subsequently recover other transcription start sites in unseen data on the same supervision label.


Sign in / Sign up

Export Citation Format

Share Document