scholarly journals Lost and Found: Re-searching and Re-scoring Proteomics Data Aids Genome Annotation and Improves Proteome Coverage

mSystems ◽  
2020 ◽  
Vol 5 (5) ◽  
Author(s):  
Patrick Willems ◽  
Igor Fijalkowski ◽  
Petra Van Damme

ABSTRACT Prokaryotic genome annotation is heavily dependent on automated gene annotation pipelines that are prone to propagate errors and underestimate genome complexity. We describe an optimized proteogenomic workflow that uses ribosome profiling (ribo-seq) and proteomic data for Salmonella enterica serovar Typhimurium to identify unannotated proteins or alternative protein forms. This data analysis encompasses the searching of cofragmenting peptides and postprocessing with extended peptide-to-spectrum quality features, including comparison to predicted fragment ion intensities. When this strategy is applied, an enhanced proteome depth is achieved, as well as greater confidence for unannotated peptide hits. We demonstrate the general applicability of our pipeline by reanalyzing public Deinococcus radiodurans data sets. Taken together, our results show that systematic reanalysis using available prokaryotic (proteome) data sets holds great promise to assist in experimentally based genome annotation. IMPORTANCE Delineation of open reading frames (ORFs) causes persistent inconsistencies in prokaryote genome annotation. We demonstrate that by advanced (re)analysis of omics data, a higher proteome coverage and sensitive detection of unannotated ORFs can be achieved, which can be exploited for conditional bacterial genome (re)annotation, which is especially relevant in view of annotating the wealth of sequenced prokaryotic genomes obtained in recent years.

Author(s):  
Patrick Willems ◽  
Igor Fijalkowski ◽  
Petra Van Damme

ABSTRACTProkaryotic genome annotation is heavily dependent on automated gene annotation pipelines that are prone to propagate errors and underestimate genome complexity. We describe an optimized proteogenomic workflow that uses ribo-seq and proteomic data of Salmonella Typhiumurium to identify unannotated proteins or alternative protein forms raised upon alternative translation initiation (i.e. N-terminal proteoforms). This data analysis encompasses the searching of co-fragmenting peptides and post-processing with extended peptide-to-spectrum quality features including comparison to predicted fragment ion intensities. When applying this strategy, an enhanced proteome-depth is achieved as well as greater confidence for unannotated peptide hits. We demonstrate the general applicability of our pipeline by re-analyzing public Deinococcus radiodurans datasets. Taken together, systematic re-analysis using available prokaryotic (proteome) datasets holds great promise to assist in experimentally-based genome annotation.


mSphere ◽  
2020 ◽  
Vol 5 (5) ◽  
Author(s):  
Brayon J. Fremin ◽  
Ami S. Bhatt

ABSTRACT Ribosome profiling (Ribo-Seq) is a powerful method to study translation in bacteria. However, Ribo-Seq signal can be observed across RNAs that one would not expect to be bound by ribosomes. For example, Escherichia coli Ribo-Seq libraries also capture reads from most noncoding RNAs (ncRNAs). While some of these ncRNAs may overlap coding regions, this alone does not explain the majority of observed signal across ncRNAs. These fragments of ncRNAs in Ribo-Seq data pass all size selection steps of the Ribo-Seq protocol and survive hours of micrococcal nuclease (MNase) treatment. In this work, we specifically focus on Ribo-Seq signal across ncRNAs and provide evidence to suggest that RNA structure, as opposed to ribosome binding, protects them from degradation and allows them to persist in the Ribo-Seq sequencing library preparation. By inspecting these “contaminant reads” in bacterial Ribo-Seq, we show that data previously disregarded in bacterial Ribo-Seq experiments may, in fact, be used to gain partial information regarding the in vivo secondary structure of ncRNAs. IMPORTANCE Structured ncRNAs are pivotal mediators of bioregulation in bacteria, and their functions are often reliant on their specific structures. Here, we first inspect Ribo-Seq reads across noncoding regions, identifying contaminant reads in these libraries. We observe that contaminant reads in bacterial Ribo-Seq experiments that are often disregarded, in fact, strongly overlap with structured regions of ncRNAs. We then perform several bioinformatic analyses to determine why these contaminant reads may persist in Ribo-Seq libraries. Finally, we highlight some structured RNA contaminants in Ribo-Seq and support the hypothesis that structures in the RNA protect them from MNase digestion. We conclude that researchers should be cautious when interpreting Ribo-Seq signal as coding without considering signal distribution. These findings also may enable us to partially resolve RNA structures, identify novel structured RNAs, and elucidate RNA structure-function relationships in bacteria at a large scale and in vivo through the reanalysis of existing Ribo-Seq data sets.


mSystems ◽  
2020 ◽  
Vol 5 (4) ◽  
Author(s):  
Robert A. Petit ◽  
Timothy D. Read

ABSTRACT Sequencing of bacterial genomes using Illumina technology has become such a standard procedure that often data are generated faster than can be conveniently analyzed. We created a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a data set setup step (Bactopia Data Sets [BaDs]), which creates a series of customizable data sets for the species of interest, the Bactopia Analysis Pipeline (BaAP), which performs quality control, genome assembly, and several other functions based on the available data sets and outputs the processed data to a structured directory format, and a series of Bactopia Tools (BaTs) that perform specific postprocessing on some or all of the processed data. BaTs include pan-genome analysis, computing average nucleotide identity between samples, extracting and profiling the 16S genes, and taxonomic classification using highly conserved genes. It is expected that the number of BaTs will increase to fill specific applications in the future. As a demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on Lactobacillus crispatus, a species that is a common part of the human vaginal microbiome. Bactopia is an open source system that can scale from projects as small as one bacterial genome to ones including thousands of genomes and that allows for great flexibility in choosing comparison data sets and options for downstream analysis. Bactopia code can be accessed at https://www.github.com/bactopia/bactopia. IMPORTANCE It is now relatively easy to obtain a high-quality draft genome sequence of a bacterium, but bioinformatic analysis requires organization and optimization of multiple open source software tools. We present Bactopia, a pipeline for bacterial genome analysis, as an option for processing bacterial genome data. Bactopia also automates downloading of data from multiple public sources and species-specific customization. Because the pipeline is written in the Nextflow language, analyses can be scaled from individual genomes on a local computer to thousands of genomes using cloud resources. As a usage example, we processed 1,664 Lactobacillus genomes from public sources and used comparative analysis workflows (Bactopia Tools) to identify and analyze members of the L. crispatus species.


2020 ◽  
Vol 44 (4) ◽  
pp. 418-431 ◽  
Author(s):  
Daria Fijalkowska ◽  
Igor Fijalkowski ◽  
Patrick Willems ◽  
Petra Van Damme

ABSTRACT With the rapid increase in the number of sequenced prokaryotic genomes, relying on automated gene annotation became a necessity. Multiple lines of evidence, however, suggest that current bacterial genome annotations may contain inconsistencies and are incomplete, even for so-called well-annotated genomes. We here discuss underexplored sources of protein diversity and new methodologies for high-throughput genome reannotation. The expression of multiple molecular forms of proteins (proteoforms) from a single gene, particularly driven by alternative translation initiation, is gaining interest as a prominent contributor to bacterial protein diversity. In consequence, riboproteogenomic pipelines were proposed to comprehensively capture proteoform expression in prokaryotes by the complementary use of (positional) proteomics and the direct readout of translated genomic regions using ribosome profiling. To complement these discoveries, tailored strategies are required for the functional characterization of newly discovered bacterial proteoforms.


2017 ◽  
Author(s):  
Elvis Ndah ◽  
Veronique Jonckheere ◽  
Adam Giess ◽  
Eivind Valen ◽  
Gerben Menschaert ◽  
...  

ABSTRACTProkaryotic genome annotation is highly dependent on automated methods, as manual curation cannot keep up with the exponential growth of sequenced genomes. Current automated methods depend heavily on sequence context and often underestimate the complexity of the proteome. We developed REPARATION (RibosomeE Profiling Assisted (Re-)AnnotaTION), a de novo algorithm that takes advantage of experimental protein translation evidence from ribosome profiling (Ribo-seq) to delineate translated open reading frames (ORFs) in bacteria, independent of genome annotation. REPARATION evaluates all possible ORFs in the genome and estimates minimum thresholds based on a growth curve model to screen for spurious ORFs. We applied REPARATION to three annotated bacterial species to obtain a more comprehensive mapping of their translation landscape in support of experimental data. In all cases, we identified hundreds of novel (small) ORFs including variants of previously annotated ORFs. Our predictions were supported by matching mass spectrometry (MS) proteomics data, sequence composition and conservation analysis. REPARATION is unique in that it makes use of experimental translation evidence to perform de novo ORF delineation in bacterial genomes irrespective of the sequence context of the reading frame.


2019 ◽  
Vol 201 (13) ◽  
Author(s):  
Malcolm E. Winkler ◽  
Donald A. Morrison

ABSTRACT DNA uptake by natural competence is a central process underlying the genetic plasticity, biology, and virulence of the human respiratory opportunistic pathogen Streptococcus pneumoniae. A study reported in this issue (J. Slager, R. Aprianto, and J.-W. Veening, J. Bacteriol. 201:e00780-18, https://doi.org/10.1128/JB.00780-18) combined deep-genome annotation and high-resolution transcriptome analyses to considerably extend the previous model of temporal regulation of competence at the operon and component gene levels. That extended study also provides a playbook for updating, refining, and extending genomic data sets and making them publicly available.


2018 ◽  
Vol 6 (23) ◽  
Author(s):  
Joelle K. Salazar ◽  
Lauren J. Gonsalves ◽  
Kristin M. Schill ◽  
Maria Sanchez Leon ◽  
Nathan Anderson ◽  
...  

ABSTRACT The genome of Listeria monocytogenes strain DFPST0073, isolated from imported fresh Mexican soft cheese in 2003, was sequenced using the Illumina MiSeq platform. Reads were assembled using SPAdes, and genome annotation was performed using the NCBI Prokaryotic Genome Annotation Pipeline.


mSphere ◽  
2020 ◽  
Vol 5 (4) ◽  
Author(s):  
Hailee M. Sorensen ◽  
Rebecca A. Keogh ◽  
Marcus A. Wittekind ◽  
Andrew R. Caillet ◽  
Richard E. Wiemels ◽  
...  

ABSTRACT Regulatory small RNAs (sRNAs) are known to play important roles in the Gram-positive bacterial pathogen Staphylococcus aureus; however, their existence is often overlooked, primarily because sRNA genes are absent from genome annotation files. Consequently, transcriptome sequencing (RNA-Seq)-based experimental approaches, performed using standard genome annotation files as a reference, have likely overlooked data for sRNAs. Previously, we created an updated S. aureus genome annotation file, which included annotations for 303 known sRNAs in USA300. Here, we utilized this updated reference file to reexamine publicly available RNA-Seq data sets in an attempt to recover lost information on sRNA expression, stability, and potential to encode peptides. First, we used transcriptomic data from 22 studies to identify how the expression of 303 sRNAs changed under 64 different experimental conditions. Next, we used RNA-Seq data from an RNA stability assay to identify highly stable/unstable sRNAs. We went on to reanalyze a ribosome profiling (Ribo-seq) data set to identify sRNAs that have the potential to encode peptides and to experimentally confirm the presence of three of these peptides in the USA300 background. Interestingly, one of these sRNAs/peptides, encoded at the tsr37 locus, influences the ability of S. aureus cells to autoaggregate. Finally, we reexamined two recently published in vivo RNA-Seq data sets, from the cystic fibrosis (CF) lung and a murine vaginal colonization study, and identified 29 sRNAs that may play a role in vivo. Collectively, these results can help inform future studies of these important regulatory elements in S. aureus and highlight the need for ongoing curating and updating of genome annotation files. IMPORTANCE Regulatory small RNAs (sRNAs) are a class of RNA molecules that are produced in bacterial cells but that typically do not encode proteins. Instead, they perform a variety of critical functions within the cell as RNA. Most bacterial genomes do not include annotations for sRNA genes, and any type of analysis that is performed using a bacterial genome as a reference will therefore overlook data for sRNAs. In this study, we reexamined hundreds of previously generated S. aureus RNA-Seq data sets and reanalyzed them to generate data for sRNAs. To do so, we utilized an updated S. aureus genome annotation file, previously generated by our group, which contains annotations for 303 sRNAs. The data generated (which were previously discarded) shed new light on sRNAs in S. aureus, most of which are unstudied, and highlight certain sRNAs that are likely to play important roles in the cell.


2020 ◽  
Author(s):  
Barbara Zehentner ◽  
Zachary Ardern ◽  
Michaela Kreitmeier ◽  
Siegfried Scherer ◽  
Klaus Neuhaus

SUMMARYThe genetic code allows six reading frames at a double-stranded DNA locus, and many open reading frames (ORFs) overlap extensively with ORFs of annotated genes (e.g., at least 30 bp or having an embedded ORF). Currently, bacterial genome annotation systematically discards embedded overlapping ORFs of genes (OLGs) due to an assumed information-content constraint, and, consequently, very few OLGs are known. Here we use strand-specific RNAseq and ribosome profiling, detecting about 200 embedded or partially overlapping ORFs of gene candidates in the pathogen E. coli O157:H7 EDL933. These are typically short, many of them show clear promoter motifs as determined by Cappable-seq, indistinguishable from those of annotated genes, and are expressed at a low level. We could express most of them as stable proteins, and 49 displayed a potential phenotype. Ribosome profiling analyses in three other E. coli strains predicted between 84 and 190 embedded antisense OLGs per strain except in E. coli K-12, which is an atypical lab strain. We also found evidence of homology to annotated genes for 100 to 300 OLGs per E. coli strain investigated. Based on this evidence we suggest that bacterial OLGs deserve attention with respect to genome annotation and coding complexity of bacterial genomes. Such sequences may constitute an important coding reserve, opening up new research in genetics and evolutionary biology.


2004 ◽  
Vol 101 (Supplement3) ◽  
pp. 326-333 ◽  
Author(s):  
Klaus D. Hamm ◽  
Gunnar Surber ◽  
Michael Schmücking ◽  
Reinhard E. Wurm ◽  
Rene Aschenbach ◽  
...  

Object. Innovative new software solutions may enable image fusion to produce the desired data superposition for precise target definition and follow-up studies in radiosurgery/stereotactic radiotherapy in patients with intracranial lesions. The aim is to integrate the anatomical and functional information completely into the radiation treatment planning and to achieve an exact comparison for follow-up examinations. Special conditions and advantages of BrainLAB's fully automatic image fusion system are evaluated and described for this purpose. Methods. In 458 patients, the radiation treatment planning and some follow-up studies were performed using an automatic image fusion technique involving the use of different imaging modalities. Each fusion was visually checked and corrected as necessary. The computerized tomography (CT) scans for radiation treatment planning (slice thickness 1.25 mm), as well as stereotactic angiography for arteriovenous malformations, were acquired using head fixation with stereotactic arc or, in the case of stereotactic radiotherapy, with a relocatable stereotactic mask. Different magnetic resonance (MR) imaging sequences (T1, T2, and fluid-attenuated inversion-recovery images) and positron emission tomography (PET) scans were obtained without head fixation. Fusion results and the effects on radiation treatment planning and follow-up studies were analyzed. The precision level of the results of the automatic fusion depended primarily on the image quality, especially the slice thickness and the field homogeneity when using MR images, as well as on patient movement during data acquisition. Fully automated image fusion of different MR, CT, and PET studies was performed for each patient. Only in a few cases was it necessary to correct the fusion manually after visual evaluation. These corrections were minor and did not materially affect treatment planning. High-quality fusion of thin slices of a region of interest with a complete head data set could be performed easily. The target volume for radiation treatment planning could be accurately delineated using multimodal information provided by CT, MR, angiography, and PET studies. The fusion of follow-up image data sets yielded results that could be successfully compared and quantitatively evaluated. Conclusions. Depending on the quality of the originally acquired image, automated image fusion can be a very valuable tool, allowing for fast (∼ 1–2 minute) and precise fusion of all relevant data sets. Fused multimodality imaging improves the target volume definition for radiation treatment planning. High-quality follow-up image data sets should be acquired for image fusion to provide exactly comparable slices and volumetric results that will contribute to quality contol.


Sign in / Sign up

Export Citation Format

Share Document