An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics

Mapping Intimacies ◽

10.1101/153213 ◽

2017 ◽

Cited By ~ 1

Author(s):

Ulrich Omasits ◽

Adithi R. Varadarajan ◽

Michael Schmid ◽

Sandra Goetze ◽

Damianos Melidis ◽

...

Keyword(s):

Gene Prediction ◽

Bartonella Henselae ◽

Prokaryotic Genome ◽

Gc Content ◽

Laboratory Strain ◽

Open Reading Frames ◽

General Applicability ◽

Protein Coding ◽

Prokaryotic Genomes ◽

Coding Potential

AbstractAccurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations.Our strategy towards accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources,ab initiogene prediction algorithms andin silicoORFs in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensiveBartonella henselaeproteomics dataset against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and variants identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin, and release iPtgxDBs forB. henselae,Bradyrhozibium diazoefficiensandEscherichia colias well as the software to generate such proteogenomics search databases for any prokaryote.

Download Full-text

Identification of Proteins Associated with Murine Cytomegalovirus Virions

Journal of Virology ◽

10.1128/jvi.78.20.11187-11197.2004 ◽

2004 ◽

Vol 78 (20) ◽

pp. 11187-11197 ◽

Cited By ~ 105

Author(s):

Lisa M. Kattenhorn ◽

Ryan Mills ◽

Markus Wagner ◽

Alexandre Lomsadze ◽

Vsevolod Makeev ◽

...

Keyword(s):

Gene Prediction ◽

Polyacrylamide Gel Electrophoresis ◽

Sodium Dodecyl ◽

Open Reading Frames ◽

Murine Cytomegalovirus ◽

Prediction Algorithm ◽

Sequencing Analysis ◽

Protein Coding ◽

Coding Potential ◽

Reading Frames

ABSTRACT Proteins associated with the murine cytomegalovirus (MCMV) viral particle were identified by a combined approach of proteomic and genomic methods. Purified MCMV virions were dissociated by complete denaturation and subjected to either separation by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and in-gel digestion or treated directly by in-solution tryptic digestion. Peptides were separated by nanoflow liquid chromatography and analyzed by tandem mass spectrometry (LC-MS/MS). The MS/MS spectra obtained were searched against a database of MCMV open reading frames (ORFs) predicted to be protein coding by an MCMV-specific version of the gene prediction algorithm GeneMarkS. We identified 38 proteins from the capsid, tegument, glycoprotein, replication, and immunomodulatory protein families, as well as 20 genes of unknown function. Observed irregularities in coding potential suggested possible sequence errors in the 3′-proximal ends of m20 and M31. These errors were experimentally confirmed by sequencing analysis. The MS data further indicated the presence of peptides derived from the unannotated ORFs ORFc225441-226898 (m166.5) and ORF105932-106072. Immunoblot experiments confirmed expression of m166.5 during viral infection.

Download Full-text

Overexpression-based detection of translatable circular RNAs is vulnerable to coexistent linear RNA byproducts

10.1101/2021.03.23.433163 ◽

2021 ◽

Author(s):

Yanyi Jiang ◽

Xiaofan Chen ◽

Wei Zhang

Keyword(s):

Open Reading Frames ◽

Systematic Evaluation ◽

Circular Rnas ◽

Protein Coding ◽

Rolling Circle ◽

Functional Investigation ◽

Overexpression System ◽

Translation Signals ◽

Coding Potential ◽

Reading Frames

AbstractIn RNA field, the demarcation between coding and non-coding has been negotiated by the recent discovery of occasionally translated circular RNAs (circRNAs). Although absent of 5’ cap structure, circRNAs can be translated cap-independently. Complementary intron-mediated overexpression is one of the most utilized methodologies for circRNA research but not without bearing echoing skepticism for its poorly defined mechanism and latent coexistent side products. In this study, leveraging such circRNA overexpression system, we have interrogated the protein-coding potential of 30 human circRNAs containing infinite open reading frames in HEK293T cells. Surprisingly, pervasive translation signals are detected by immunoblotting. However, intensive mutagenesis reveals that numerous translation signals are generated independently of circRNA synthesis. We have developed a dual tag strategy to isolate translation noise and directly demonstrate that the fallacious translation signals originate from cryptically spliced linear transcripts. The concomitant linear RNA byproducts, presumably concatemers, can be translated to allow pseudo rolling circle translation signals, and can involve backsplicing junction (BSJ) to disqualify the BSJ-based evidence for circRNA translation. We also find non-AUG start codons may engage in the translation initiation of circRNAs. Taken together, our systematic evaluation sheds light on heterogeneous translational outputs from circRNA overexpression vector and comes with a caveat that ectopic overexpression technique necessitates extremely rigorous control setup in circRNA translation and functional investigation.

Download Full-text

PHANOTATE: a novel approach to gene identification in phage genomes

Bioinformatics ◽

10.1093/bioinformatics/btz265 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4537-4542 ◽

Cited By ~ 24

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Elizabeth A Dinsdale ◽

Brian Souza ◽

Robert A Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Supplementary Information ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach

Abstract Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

When Long Noncoding Becomes Protein Coding

Molecular and Cellular Biology ◽

10.1128/mcb.00528-19 ◽

2020 ◽

Vol 40 (6) ◽

Cited By ~ 14

Author(s):

Corrine Corrina R. Hartford ◽

Ashish Lal

Keyword(s):

Cell Division ◽

Cell Signaling ◽

Transcription Regulation ◽

Noncoding Rnas ◽

Long Noncoding Rnas ◽

Open Reading Frames ◽

Protein Coding ◽

Small Proteins ◽

Coding Potential ◽

Reading Frames

ABSTRACT Recent advancements in genetic and proteomic technologies have revealed that more of the genome encodes proteins than originally thought possible. Specifically, some putative long noncoding RNAs (lncRNAs) have been misannotated as noncoding. Numerous lncRNAs have been found to contain short open reading frames (sORFs) which have been overlooked because of their small size. Many of these sORFs encode small proteins or micropeptides with fundamental biological importance. These micropeptides can aid in diverse processes, including cell division, transcription regulation, and cell signaling. Here we discuss strategies for establishing the coding potential of putative lncRNAs and describe various functions of known micropeptides.

Download Full-text

Long-read assembly of a Great Dane genome highlights the contribution of GC-rich sequence and mobile elements to canine genomes

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2016274118 ◽

2021 ◽

Vol 118 (11) ◽

pp. e2016274118 ◽

Cited By ~ 1

Author(s):

Julia V. Halo ◽

Amanda L. Pendleton ◽

Feichen Shen ◽

Aurélien J. Doucet ◽

Thomas Derrien ◽

...

Keyword(s):

Consensus Sequence ◽

Gc Content ◽

Open Reading Frames ◽

Reference Sequence ◽

Transcription Start ◽

Protein Coding ◽

Short Interspersed Elements ◽

Technological Advances ◽

Retrotransposon Activity ◽

Great Dane

Technological advances have allowed improvements in genome reference sequence assemblies. Here, we combined long- and short-read sequence resources to assemble the genome of a female Great Dane dog. This assembly has improved continuity compared to the existing Boxer-derived (CanFam3.1) reference genome. Annotation of the Great Dane assembly identified 22,182 protein-coding gene models and 7,049 long noncoding RNAs, including 49 protein-coding genes not present in the CanFam3.1 reference. The Great Dane assembly spans the majority of sequence gaps in the CanFam3.1 reference and illustrates that 2,151 gaps overlap the transcription start site of a predicted protein-coding gene. Moreover, a subset of the resolved gaps, which have an 80.95% median GC content, localize to transcription start sites and recombination hotspots more often than expected by chance, suggesting the stable canine recombinational landscape has shaped genome architecture. Alignment of the Great Dane and CanFam3.1 assemblies identified 16,834 deletions and 15,621 insertions, as well as 2,665 deletions and 3,493 insertions located on secondary contigs. These structural variants are dominated by retrotransposon insertion/deletion polymorphisms and include 16,221 dimorphic canine short interspersed elements (SINECs) and 1,121 dimorphic long interspersed element-1 sequences (LINE-1_Cfs). Analysis of sequences flanking the 3′ end of LINE-1_Cfs (i.e., LINE-1_Cf 3′-transductions) suggests multiple retrotransposition-competent LINE-1_Cfs segregate among dog populations. Consistent with this conclusion, we demonstrate that a canine LINE-1_Cf element with intact open reading frames can retrotranspose its own RNA and that of a SINEC_Cf consensus sequence in cultured human cells, implicating ongoing retrotransposon activity as a driver of canine genetic variation.

Download Full-text

Long antiparallel open reading frames are unlikely to be encoding essential proteins in prokaryotic genomes

10.1101/724807 ◽

2019 ◽

Author(s):

Denis Moshensky ◽

Andrei Alexeevski

Keyword(s):

Negative Selection ◽

Stop Codon ◽

Biological Significance ◽

Open Reading Frames ◽

Overlapping Genes ◽

Base Pairs ◽

Protein Coding ◽

Essential Proteins ◽

Prokaryotic Genomes ◽

Reading Frames

AbstractThe origin and evolution of genes that have common base pairs (overlapping genes) are of particular interest due to their influencing each other. Especially intriguing are gene pairs with long overlaps. In prokaryotes, co-directional overlaps longer than 60 bp were shown to be nonexistent except for some instances. A few antiparallel prokaryotic genes with long overlaps were described in the literature. We have analyzed putative long antiparallel overlapping genes to determine whether open reading frames (ORFs) located opposite to genes (antiparallel ORFs) can be protein-coding genes.We have confirmed that long antiparallel ORFs (AORFs) are observed reliably to be more frequent than expected. There are 10 472 000 AORFs in 929 analyzed genomes with overlap length more than 180 bp. Stop codons on the opposite to the coding strand are avoided in 2 898 cases with Benjamini-Hochberg threshold 0.01.Using Ka/Ks ratio calculations, we have revealed that long AORFs do not affect the type of selection acting on genes in a vast majority of cases. This observation indicates that long AORFs translations commonly are not under negative selection.The demonstrative example is 282 longer than 1 800 bp AORFs found opposite to extremely conserved dnaK genes. Translations of these AORFs were annotated “glutamate dehydrogenases” and were included into Pfam database as third protein family of glutamate dehydrogenases, PF10712. Ka/Ks analysis has demonstrated that if these translations correspond to proteins, they are not subjected by negative selection while dnaK genes are under strong stabilizing selection. Moreover, we have found other arguments against the hypothesis that these AORFs encode essential proteins, proteins indispensable for cellular machinery.However, some AORFs, in particular, dnaK related, have been found slightly resisting to synonymous changes in genes. It indicates the possibility of their translation. We speculate that translations of certain AORFs might have a functional role other than encoding essential proteins.Essential genes are unlikely to be encoded by AORFs in prokaryotic genomes. Nevertheless, some AORFs might have biological significance associated with their translations.Author summaryGenes that have common base pairs are called overlapping genes. We have examined the most intriguing case: if gene pairs encoded on opposite DNA strands exist in prokaryotes. An intersection length threshold 180 bp has been used. A few such pairs of genes were experimentally confirmed.We have detected all long antiparallel ORFs in 929 prokaryotic genomes and have found that the number of open reading frames, located opposite to annotated genes, is much more than expected according to statistical model. We have developed a measure of stop codon avoidance on the opposite strand. The lengths of found antiparallel ORFs with stop codon avoidance are typical for prokaryotic genes.Comparative genomics analysis shows that long antiparallel ORFs (AORFs) are unlikely to be essential protein-coding genes. We have analyzed distributions of features typical for essential proteins among formal translations of all long AORFs: prevalence of negative selection, non-uniformity of a conserved positions distribution in a multiple alignment of homologous proteins, the character of homologs distribution in phylogenetic tree of prokaryotes. All of them have not been observed for the majority of long AORFs. Particularly, the same results have been obtained for some experimentally confirmed AOGs.Thus, pairs of antiparallel overlapping essential genes are unlikely to exist. On the other hand, some antiparallel ORFs affect the evolution of genes opposite that they are located. Consequently, translations of some antiparallel ORFs might have yet unknown biological significance.

Download Full-text

Mitochondrial genome evolution of placozoans: gene rearrangements and repeat expansions

Genome Biology and Evolution ◽

10.1093/gbe/evaa213 ◽

2020 ◽

Author(s):

Hideyuki Miyazawa ◽

Hans-Jürgen Osigus ◽

Sarah Rolfes ◽

Kai Kamm ◽

Bernd Schierwater ◽

...

Keyword(s):

Mitochondrial Genome ◽

Gc Content ◽

Distribution Patterns ◽

Sister Group ◽

Open Reading Frames ◽

Mitochondrial Genomes ◽

Inverted Repeats ◽

Protein Coding ◽

Group I ◽

Phylogenomic Analyses

Abstract Placozoans, non-bilaterian animals with the simplest known metazoan bauplan, are currently classified into 20 haplotypes belonging to three genera, Polyplacotoma, Trichoplax, and Hoilungia. The latter two comprise two and five clades, respectively. In Trichoplax and Hoilungia, previous studies on six haplotypes belonging to four different clades have shown that their mtDNA are circular chromosomes of 32-43 kbp in size, which encode 12 protein-coding genes, 24 tRNAs, and 2 rRNAs. These mitochondrial genomes (mitogenomes) also show unique features rarely seen in other metazoans, including open reading frames (ORFs) of unknown function, and group I and II introns. Here, we report seven new mitogenomes, covering the five previously described haplotypes H2, H17, H19, H9, and H11, as well as two new haplotypes, H23 (clade III) and H24 (clade VII). The overall gene content is shared between all placozoan mitochondrial genomes, but genome sizes, gene orders, and several exon-intron boundaries vary among clades. Phylogenomic analyses strongly support a tree topology different from previous 16S rRNA analyses, with clade VI as the sister group to all other Hoilungia clades. We found small inverted repeats in all 13 mitochondrial genomes of the Trichoplax and Hoilungia genera and evaluated their distribution patterns among haplotypes. Since P. mediterranea (H0), the sister to the remaining haplotypes, has a small mitochondrial genome with few small inverted repeats and ORFs, we hypothesized that the proliferation of inverted repeats and ORFs substantially contributed to the observed increase in the size and GC content of the Trichoplax and Hoilungia mitochondrial genomes.

Download Full-text

Complete Genome Sequence and Comparative Analysis of Synechococcus sp. CS-601 (SynAce01), a Cold-Adapted Cyanobacterium from an Oligotrophic Antarctic Habitat

International Journal of Molecular Sciences ◽

10.3390/ijms20010152 ◽

2019 ◽

Vol 20 (1) ◽

pp. 152 ◽

Cited By ~ 5

Author(s):

Jie Tang ◽

Lian-Ming Du ◽

Yuan-Mei Liang ◽

Maurycy Daroch

Keyword(s):

Marine Environment ◽

Gene Prediction ◽

Gc Content ◽

Adaptation Strategy ◽

Salt Adaptation ◽

Trna Genes ◽

Protein Coding ◽

Single Chromosome ◽

Cold Adapted ◽

Synechococcus Sp

Marine picocyanobacteria belonging to Synechococcus are major contributors to the global carbon cycle, however the genomic information of its cold-adapted members has been lacking to date. To fill this void the genome of a cold-adapted planktonic cyanobacterium Synechococcus sp. CS-601 (SynAce01) has been sequenced. The genome of the strain contains a single chromosome of approximately 2.75 MBp and GC content of 63.92%. Gene prediction yielded 2984 protein coding sequences and 44 tRNA genes. The genome contained evidence of horizontal gene transfer events during its evolution. CS-601 appears as a transport generalist with some specific adaptation to an oligotrophic marine environment. It has a broad repertoire of transporters of both inorganic and organic nutrients to survive in inhospitable environments. The cold adaptation of the strain exhibited characteristics of a psychrotroph rather than psychrophile. Its salt adaptation strategy is likely to rely on the uptake and synthesis of osmolytes, like glycerol or glycine betaine. Overall, the genome reveals two distinct patterns of adaptation to the inhospitable environment of Antarctica. Adaptation to an oligotrophic marine environment is likely due to an abundance of genes, probably acquired horizontally, that are associated with increased transport of nutrients, osmolytes, and light harvesting. On the other hand, adaptations to low temperatures are likely due to prolonged evolutionary changes.

Download Full-text

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Microorganisms ◽

10.3390/microorganisms9010129 ◽

2021 ◽

Vol 9 (1) ◽

pp. 129

Author(s):

Katelyn McNair ◽

Carol L. Ecale Zhou ◽

Brian Souza ◽

Stephanie Malfatti ◽

Robert A. Edwards

Keyword(s):

Amino Acid ◽

Gene Prediction ◽

Training Model ◽

Entropy Density ◽

Open Reading Frames ◽

Initial Training ◽

Training Set ◽

Protein Coding ◽

Protein Coding Genes ◽

Reading Frames

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

Download Full-text

THEA: A novel approach to gene identification in phage genomes

10.1101/265983 ◽

2018 ◽

Cited By ~ 3

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Brian Souza ◽

Robert A. Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach ◽

Novel Method

AbstractMotivationCurrently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap, and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present THEA (The Algorithm), a novel method for gene calling specifically designed for phage genomes. While the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use graph theory to find the optimal path.ResultsWe compare THEA to other gene callers by annotating a set of 2,133 complete phage genomes from GenBank, using THEA and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with THEA predicting significantly more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and sequence read archive, and found that they are present at levels that suggest that these are functional protein coding genes.Availability and ImplementationThe source code and all files can be found at: https://github.com/deprekate/THEAContactKatelyn McNair: [email protected]

Download Full-text