THEA: A novel approach to gene identification in phage genomes

Mapping Intimacies ◽

10.1101/265983 ◽

2018 ◽

Cited By ~ 3

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Brian Souza ◽

Robert A. Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach ◽

Novel Method

AbstractMotivationCurrently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap, and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present THEA (The Algorithm), a novel method for gene calling specifically designed for phage genomes. While the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use graph theory to find the optimal path.ResultsWe compare THEA to other gene callers by annotating a set of 2,133 complete phage genomes from GenBank, using THEA and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with THEA predicting significantly more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and sequence read archive, and found that they are present at levels that suggest that these are functional protein coding genes.Availability and ImplementationThe source code and all files can be found at: https://github.com/deprekate/THEAContactKatelyn McNair: [email protected]

Download Full-text

PHANOTATE: a novel approach to gene identification in phage genomes

Bioinformatics ◽

10.1093/bioinformatics/btz265 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4537-4542 ◽

Cited By ~ 24

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Elizabeth A Dinsdale ◽

Brian Souza ◽

Robert A Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Supplementary Information ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach

Abstract Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identification of Proteins Associated with Murine Cytomegalovirus Virions

Journal of Virology ◽

10.1128/jvi.78.20.11187-11197.2004 ◽

2004 ◽

Vol 78 (20) ◽

pp. 11187-11197 ◽

Cited By ~ 105

Author(s):

Lisa M. Kattenhorn ◽

Ryan Mills ◽

Markus Wagner ◽

Alexandre Lomsadze ◽

Vsevolod Makeev ◽

...

Keyword(s):

Gene Prediction ◽

Polyacrylamide Gel Electrophoresis ◽

Sodium Dodecyl ◽

Open Reading Frames ◽

Murine Cytomegalovirus ◽

Prediction Algorithm ◽

Sequencing Analysis ◽

Protein Coding ◽

Coding Potential ◽

Reading Frames

ABSTRACT Proteins associated with the murine cytomegalovirus (MCMV) viral particle were identified by a combined approach of proteomic and genomic methods. Purified MCMV virions were dissociated by complete denaturation and subjected to either separation by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and in-gel digestion or treated directly by in-solution tryptic digestion. Peptides were separated by nanoflow liquid chromatography and analyzed by tandem mass spectrometry (LC-MS/MS). The MS/MS spectra obtained were searched against a database of MCMV open reading frames (ORFs) predicted to be protein coding by an MCMV-specific version of the gene prediction algorithm GeneMarkS. We identified 38 proteins from the capsid, tegument, glycoprotein, replication, and immunomodulatory protein families, as well as 20 genes of unknown function. Observed irregularities in coding potential suggested possible sequence errors in the 3′-proximal ends of m20 and M31. These errors were experimentally confirmed by sequencing analysis. The MS data further indicated the presence of peptides derived from the unannotated ORFs ORFc225441-226898 (m166.5) and ORF105932-106072. Immunoblot experiments confirmed expression of m166.5 during viral infection.

Download Full-text

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Microorganisms ◽

10.3390/microorganisms9010129 ◽

2021 ◽

Vol 9 (1) ◽

pp. 129

Author(s):

Katelyn McNair ◽

Carol L. Ecale Zhou ◽

Brian Souza ◽

Stephanie Malfatti ◽

Robert A. Edwards

Keyword(s):

Amino Acid ◽

Gene Prediction ◽

Training Model ◽

Entropy Density ◽

Open Reading Frames ◽

Initial Training ◽

Training Set ◽

Protein Coding ◽

Protein Coding Genes ◽

Reading Frames

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

Download Full-text

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

Database ◽

10.1093/database/baaa088 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Alejandro Rubio ◽

Pablo Mier ◽

Miguel A Andrade-Navarro ◽

Andrés Garzón ◽

Juan Jiménez ◽

...

Keyword(s):

Gene Prediction ◽

Biological Sequences ◽

Protein Database ◽

Computational Tools ◽

Protein Coding ◽

Homology Searching ◽

Secondary Analyses ◽

Source Of Error ◽

Gene Finder ◽

Erroneous Data

Abstract The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error.

Download Full-text

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics

10.1101/153213 ◽

2017 ◽

Cited By ~ 1

Author(s):

Ulrich Omasits ◽

Adithi R. Varadarajan ◽

Michael Schmid ◽

Sandra Goetze ◽

Damianos Melidis ◽

...

Keyword(s):

Gene Prediction ◽

Bartonella Henselae ◽

Prokaryotic Genome ◽

Gc Content ◽

Laboratory Strain ◽

Open Reading Frames ◽

General Applicability ◽

Protein Coding ◽

Prokaryotic Genomes ◽

Coding Potential

AbstractAccurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations.Our strategy towards accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources,ab initiogene prediction algorithms andin silicoORFs in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensiveBartonella henselaeproteomics dataset against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and variants identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin, and release iPtgxDBs forB. henselae,Bradyrhozibium diazoefficiensandEscherichia colias well as the software to generate such proteogenomics search databases for any prokaryote.

Download Full-text

BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database

10.1101/2020.08.10.245134 ◽

2020 ◽

Cited By ~ 5

Author(s):

Tomáš Brůna ◽

Katharina J. Hoff ◽

Alexandre Lomsadze ◽

Mario Stanke ◽

Mark Borodovsky

Keyword(s):

Genome Annotation ◽

Prediction Accuracy ◽

Gene Prediction ◽

Accurate Method ◽

Eukaryotic Genome ◽

Structural Annotation ◽

Protein Database ◽

Protein Coding ◽

Annotation Pipeline ◽

The One

AbstractFull automation of gene prediction has become an important bioinformatics task since the advent of next generation sequencing. The eukaryotic genome annotation pipeline BRAKER1 had combined self-training GeneMark-ET with AUGUSTUS to generate genes’ coordinates with support of transcriptomic data. Here, we introduce BRAKER2, a pipeline with GeneMark-EP+ and AUGUSTUS externally supported by cross-species protein sequences aligned to the genome. Among the challenges addressed in the development of the new pipeline was generation of reliable hints to the locations of protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. Under equal conditions, the gene prediction accuracy of BRAKER2 was shown to be higher than the one of MAKER2, yet another genome annotation pipeline. Also, in comparison with BRAKER1 supported by a large volume of transcript data, BRAKER2 could produce a better gene prediction accuracy if the evolutionary distances to the reference species in the protein database were rather small. All over, our tests demonstrated that fully automatic BRAKER2 is a fast and accurate method for structural annotation of novel eukaryotic genomes.

Download Full-text

Translation of 5′ leaders is pervasive in genes resistant to eIF2 repression

eLife ◽

10.7554/elife.03971 ◽

2015 ◽

Vol 4 ◽

Cited By ~ 171

Author(s):

Dmitry E Andreev ◽

Patrick BF O'Connor ◽

Ciara Fahey ◽

Elaine M Kenny ◽

Ilya M Terenin ◽

...

Keyword(s):

Ribosome Profiling ◽

Open Reading Frames ◽

Translation Initiation Factor ◽

Upstream Open Reading Frame ◽

Initiation Factor ◽

Translation Control ◽

Functional Protein ◽

Protein Coding ◽

Reading Frame ◽

Initiation Factor 2

Eukaryotic cells rapidly reduce protein synthesis in response to various stress conditions. This can be achieved by the phosphorylation-mediated inactivation of a key translation initiation factor, eukaryotic initiation factor 2 (eIF2). However, the persistent translation of certain mRNAs is required for deployment of an adequate stress response. We carried out ribosome profiling of cultured human cells under conditions of severe stress induced with sodium arsenite. Although this led to a 5.4-fold general translational repression, the protein coding open reading frames (ORFs) of certain individual mRNAs exhibited resistance to the inhibition. Nearly all resistant transcripts possess at least one efficiently translated upstream open reading frame (uORF) that represses translation of the main coding ORF under normal conditions. Site-specific mutagenesis of two identified stress resistant mRNAs (PPP1R15B and IFRD1) demonstrated that a single uORF is sufficient for eIF2-mediated translation control in both cases. Phylogenetic analysis suggests that at least two regulatory uORFs (namely, in SLC35A4 and MIEF1) encode functional protein products.

Download Full-text

A novel approach to the computation of one-loop three- and four-point functions: I. The real mass case

Progress of Theoretical and Experimental Physics ◽

10.1093/ptep/ptz114 ◽

2019 ◽

Vol 2019 (11) ◽

Cited By ~ 4

Author(s):

J Ph Guillet ◽

E Pilon ◽

Y Shimizu ◽

M S Zidi

Keyword(s):

Building Blocks ◽

The Real ◽

Alternative Method ◽

Novel Approach ◽

Reduction Methods ◽

Real Mass ◽

Novel Method ◽

Algebraic Reduction ◽

The One ◽

Mass Case

Abstract This article is the first of a series of three presenting an alternative method of computing the one-loop scalar integrals. This novel method enjoys a couple of interesting features as compared with the method closely following ’t Hooft and Veltman adopted previously. It directly proceeds in terms of the quantities driving algebraic reduction methods. It applies to the three-point functions and, in a similar way, to the four-point functions. It also extends to complex masses without much complication. Lastly, it extends to kinematics more general than that of the physical, e.g., collider processes relevant at one loop. This last feature may be useful when considering the application of this method beyond one loop using generalized one-loop integrals as building blocks.

Download Full-text

Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome

Genome Biology ◽

10.1186/s13059-021-02369-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Robin-Lee Troskie ◽

Yohaann Jafrani ◽

Tim R. Mercer ◽

Adam D. Ewing ◽

Geoffrey J. Faulkner ◽

...

Keyword(s):

Cultured Cells ◽

Open Reading Frames ◽

Cdna Sequencing ◽

Protein Coding ◽

Dynamic Component ◽

Gene Copies ◽

Long Read ◽

Normal Human ◽

Reading Frames ◽

Transcriptional Landscape

AbstractPseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.

Download Full-text

Disrupting upstream translation in mRNAs is associated with human disease

Nature Communications ◽

10.1038/s41467-021-21812-1 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

David S. M. Lee ◽

Joseph Park ◽

Andrew Kromer ◽

Aris Baras ◽

Daniel J. Rader ◽

...

Keyword(s):

Protein Expression ◽

Biological Significance ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Protein Coding ◽

Stop Codons ◽

Human Genes ◽

Strong Negative Selection ◽

Disease Associations ◽

Reading Frames

AbstractRibosome-profiling has uncovered pervasive translation in non-canonical open reading frames, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals, respectively, and elucidate their impact on protein expression in human cells. Our results suggest translation disrupting mechanisms relating uORF variation to reduced protein expression, and demonstrate that translation at uORFs is genetically constrained in 50% of human genes.

Download Full-text