Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Katelyn McNair; Carol L. Ecale Zhou; Brian Souza; Stephanie Malfatti; Robert A. Edwards

doi:10.3390/microorganisms9010129

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Microorganisms ◽

10.3390/microorganisms9010129 ◽

2021 ◽

Vol 9 (1) ◽

pp. 129

Author(s):

Katelyn McNair ◽

Carol L. Ecale Zhou ◽

Brian Souza ◽

Stephanie Malfatti ◽

Robert A. Edwards

Keyword(s):

Amino Acid ◽

Gene Prediction ◽

Training Model ◽

Entropy Density ◽

Open Reading Frames ◽

Initial Training ◽

Training Set ◽

Protein Coding ◽

Protein Coding Genes ◽

Reading Frames

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

Download Full-text

Identification of Proteins Associated with Murine Cytomegalovirus Virions

Journal of Virology ◽

10.1128/jvi.78.20.11187-11197.2004 ◽

2004 ◽

Vol 78 (20) ◽

pp. 11187-11197 ◽

Cited By ~ 105

Author(s):

Lisa M. Kattenhorn ◽

Ryan Mills ◽

Markus Wagner ◽

Alexandre Lomsadze ◽

Vsevolod Makeev ◽

...

Keyword(s):

Gene Prediction ◽

Polyacrylamide Gel Electrophoresis ◽

Sodium Dodecyl ◽

Open Reading Frames ◽

Murine Cytomegalovirus ◽

Prediction Algorithm ◽

Sequencing Analysis ◽

Protein Coding ◽

Coding Potential ◽

Reading Frames

ABSTRACT Proteins associated with the murine cytomegalovirus (MCMV) viral particle were identified by a combined approach of proteomic and genomic methods. Purified MCMV virions were dissociated by complete denaturation and subjected to either separation by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and in-gel digestion or treated directly by in-solution tryptic digestion. Peptides were separated by nanoflow liquid chromatography and analyzed by tandem mass spectrometry (LC-MS/MS). The MS/MS spectra obtained were searched against a database of MCMV open reading frames (ORFs) predicted to be protein coding by an MCMV-specific version of the gene prediction algorithm GeneMarkS. We identified 38 proteins from the capsid, tegument, glycoprotein, replication, and immunomodulatory protein families, as well as 20 genes of unknown function. Observed irregularities in coding potential suggested possible sequence errors in the 3′-proximal ends of m20 and M31. These errors were experimentally confirmed by sequencing analysis. The MS data further indicated the presence of peptides derived from the unannotated ORFs ORFc225441-226898 (m166.5) and ORF105932-106072. Immunoblot experiments confirmed expression of m166.5 during viral infection.

Download Full-text

TA, GT and AC are significantly under-represented in open reading frames of prokaryotic and eukaryotic protein-coding genes

Molecular Genetics and Genomics ◽

10.1007/s00438-019-01535-1 ◽

2019 ◽

Vol 294 (3) ◽

pp. 637-647 ◽

Cited By ~ 2

Author(s):

Yong Wang ◽

Zhen Zeng ◽

Tian-Lei Liu ◽

Ling Sun ◽

Qin Yao ◽

...

Keyword(s):

Open Reading Frames ◽

Protein Coding ◽

Eukaryotic Protein ◽

Protein Coding Genes ◽

Reading Frames

Download Full-text

The complete mitochondrial genome of the Dutch elm disease fungus Ophiostoma novo-ulmi subsp. novo-ulmi

Canadian Journal of Microbiology ◽

10.1139/cjm-2017-0605 ◽

2018 ◽

Vol 64 (5) ◽

pp. 339-348 ◽

Cited By ~ 3

Author(s):

Talal George Abboud ◽

Abdullah Zubaer ◽

Alvan Wai ◽

Georg Hausner

Keyword(s):

Mitochondrial Genome ◽

Complete Mitochondrial Genome ◽

Mitochondrial Protein ◽

Open Reading Frames ◽

Dutch Elm Disease ◽

Protein Coding ◽

Protein Coding Genes ◽

Statistical Support ◽

Size Variability ◽

Reading Frames

Ophiostoma novo-ulmi, a member of the Ophiostomatales (Ascomycota), is the causal agent of the current Dutch elm disease pandemic in Europe and North America. The complete mitochondrial genome (mtDNA) of Ophiostoma novo-ulmi subsp. novo-ulmi, the European component of O. novo-ulmi, has been sequenced and annotated. Gene order (synteny) among the currently available members of the Ophiostomatales was examined and appears to be conserved, and mtDNA size variability among the Ophiostomatales is due in part to the presence of introns and their encoded open reading frames. Phylogenetic analysis of concatenated mitochondrial protein-coding genes yielded phylogenetic estimates for various members of the Ophiostomatales, with strong statistical support showing that mtDNA analysis may provide valuable insights into the evolution of the Ophiostomatales.

Download Full-text

The Growth-Arrest-Specific (GAS)-5 Long Non-Coding RNA: A Fascinating lncRNA Widely Expressed in Cancers

Non-Coding RNA ◽

10.3390/ncrna5030046 ◽

2019 ◽

Vol 5 (3) ◽

pp. 46 ◽

Cited By ~ 13

Author(s):

Anton Goustin ◽

Pattaraporn Thepsuwan ◽

Mary Kosir ◽

Leonard Lipovich

Keyword(s):

Genomic Dna ◽

Growth Arrest ◽

Evolutionary Conservation ◽

Open Reading Frames ◽

Protein Coding ◽

Protein Coding Genes ◽

Non Coding Rna ◽

Mammalian Genomes ◽

Long Non Coding Rna ◽

Reading Frames

Long non-coding RNA (lncRNA) genes encode non-messenger RNAs that lack open reading frames (ORFs) longer than 300 nucleotides, lack evolutionary conservation in their shorter ORFs, and do not belong to any classical non-coding RNA category. LncRNA genes equal, or exceed in number, protein-coding genes in mammalian genomes. Most mammalian genomes harbor ~20,000 protein-coding genes that give rise to conventional messenger RNA (mRNA) transcripts. These coding genes exhibit sweeping evolutionary conservation in their ORFs. LncRNAs function via different mechanisms, including but not limited to: (1) serving as “enhancer” RNAs regulating nearby coding genes in cis; (2) functioning as scaffolds to create ribonucleoprotein (RNP) complexes; (3) serving as sponges for microRNAs; (4) acting as ribo-mimics of consensus transcription factor binding sites in genomic DNA; (5) hybridizing to other nucleic acids (mRNAs and genomic DNA); and, rarely, (6) as templates encoding small open reading frames (smORFs) that may encode short proteins. Any given lncRNA may have more than one of these functions. This review focuses on one fascinating case—the growth-arrest-specific (GAS)-5 gene, encoding a complicated repertoire of alternatively-spliced lncRNA isoforms. GAS5 is also a host gene of numerous small nucleolar (sno) RNAs, which are processed from its introns. Publications about this lncRNA date back over three decades, covering its role in cell proliferation, cell differentiation, and cancer. The GAS5 story has drawn in contributions from prominent molecular geneticists who attempted to define its tumor suppressor function in mechanistic terms. The evidence suggests that rodent Gas5 and human GAS5 functions may be different, despite the conserved multi-exonic architecture featuring intronic snoRNAs, and positional conservation on syntenic chromosomal regions indicating that the rodent Gas5 gene is the true ortholog of the GAS5 gene in man and other apes. There is no single answer to the molecular mechanism of GAS5 action. Our goal here is to summarize competing, not mutually exclusive, mechanistic explanations of GAS5 function that have compelling experimental support.

Download Full-text

Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome

Genome Biology ◽

10.1186/s13059-021-02369-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Robin-Lee Troskie ◽

Yohaann Jafrani ◽

Tim R. Mercer ◽

Adam D. Ewing ◽

Geoffrey J. Faulkner ◽

...

Keyword(s):

Cultured Cells ◽

Open Reading Frames ◽

Cdna Sequencing ◽

Protein Coding ◽

Dynamic Component ◽

Gene Copies ◽

Long Read ◽

Normal Human ◽

Reading Frames ◽

Transcriptional Landscape

AbstractPseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.

Download Full-text

Disrupting upstream translation in mRNAs is associated with human disease

Nature Communications ◽

10.1038/s41467-021-21812-1 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

David S. M. Lee ◽

Joseph Park ◽

Andrew Kromer ◽

Aris Baras ◽

Daniel J. Rader ◽

...

Keyword(s):

Protein Expression ◽

Biological Significance ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Protein Coding ◽

Stop Codons ◽

Human Genes ◽

Strong Negative Selection ◽

Disease Associations ◽

Reading Frames

AbstractRibosome-profiling has uncovered pervasive translation in non-canonical open reading frames, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals, respectively, and elucidate their impact on protein expression in human cells. Our results suggest translation disrupting mechanisms relating uORF variation to reduced protein expression, and demonstrate that translation at uORFs is genetically constrained in 50% of human genes.

Download Full-text

Sequence and domain organization of scruin, an actin-cross-linking protein in the acrosomal process of Limulus sperm.

The Journal of Cell Biology ◽

10.1083/jcb.128.1.51 ◽

1995 ◽

Vol 128 (1) ◽

pp. 51-60 ◽

Cited By ~ 46

Author(s):

M Way ◽

M Sanders ◽

C Garcia ◽

J Sakai ◽

P Matsudaira

Keyword(s):

Amino Acid ◽

Actin Filaments ◽

Limited Proteolysis ◽

Open Reading Frames ◽

Actin Binding ◽

Molar Ratio ◽

Domain Organization ◽

Ring Canals ◽

Drosophila Protein ◽

Reading Frames

The acrosomal process of Limulus sperm is an 80-microns long finger of membrane supported by a crystalline bundle of actin filaments. The filaments in this bundle are crosslinked by a 102-kD protein, scruin present in a 1:1 molar ratio with actin. Recent image reconstruction of scruin decorated actin filaments at 13-A resolution shows that scruin is organized into two equally sized domains bound to separate actin subunits in the same filament. We have cloned and sequenced the gene for scruin from a Limulus testes cDNA library. The deduced amino acid sequence of scruin reflects the domain organization of scruin: it consists of a tandem pair of homologous domains joined by a linker region. The domain organization of scruin is confirmed by limited proteolysis of the purified acrosomal process. Three different proteases cleave the native protein in a 5-kD Protease-sensitive region in the middle of the molecule to generate an NH2-terminal 47-kD and a COOH-terminal 56-kD protease-resistant domains. Although the protein sequence of scruin has no homology to any known actin-binding protein, it has similarities to several proteins, including four open reading frames of unknown function in poxviruses, as well as kelch, a Drosophila protein localized to actin-rich ring canals. All proteins that show homologies to scruin are characterized by the presence of an approximately 50-amino acid residue motif that is repeated between two and seven times. Crystallographic studies reveal this motif represents a four beta-stranded fold that is characteristic of the "superbarrel" structural fold found in the sialidase family of proteins. These results suggest that the two domains of scruin seen in EM reconstructions are superbarrel folds, and they present the possibility that other members of this family may also bind actin.

Download Full-text

Inferring the frequency spectrum of derived variants to quantify adaptive molecular evolution in protein-coding genes of Drosophila melanogaster

10.1101/039404 ◽

2016 ◽

Cited By ~ 1

Author(s):

Peter D. Keightley ◽

Jose Campos ◽

Tom Booker ◽

Brian Charlesworth

Keyword(s):

Amino Acid ◽

Molecular Evolution ◽

Frequency Spectrum ◽

Population Genomics ◽

Selective Constraint ◽

Accurate Estimation ◽

Protein Coding ◽

Protein Coding Genes ◽

Adaptive Molecular Evolution ◽

Amino Acid Mutations

Many approaches for inferring adaptive molecular evolution analyze the unfolded site frequency spectrum (SFS), a vector of counts of sites with different numbers of copies of derived alleles in a sample of alleles from a population. Accurate inference of the high copy number elements of the SFS is difficult, however, because of misassignment of alleles as derived versus ancestral. This is a known problem with parsimony using outgroup species. Here, we show that the problem is particularly serious if there is variation in the substitution rate among sites brought about by variation in selective constraint levels. We present a new method for inferring the SFS using one or two outgroups, which attempts to overcome the problem of misassignment. We show that two outgroups are required for accurate estimation of the SFS if there is substantial variation in selective constraints, which is expected to be the case for nonsynonymous sites of protein-coding genes. We apply the method to estimate unfolded SFSs for synonymous and nonsynonymous sites from Phase 2 of the Drosophila Population Genomics Project. We use the unfolded spectra to estimate the frequency and strength of advantageous and deleterious mutations, and estimate that ~50% of amino acid substitutions are positively selected, but that less than 0.5% of new amino acid mutations are beneficial, with a scaled selection strength of Nes ≈ 12.

Download Full-text

Overexpression-based detection of translatable circular RNAs is vulnerable to coexistent linear RNA byproducts

10.1101/2021.03.23.433163 ◽

2021 ◽

Author(s):

Yanyi Jiang ◽

Xiaofan Chen ◽

Wei Zhang

Keyword(s):

Open Reading Frames ◽

Systematic Evaluation ◽

Circular Rnas ◽

Protein Coding ◽

Rolling Circle ◽

Functional Investigation ◽

Overexpression System ◽

Translation Signals ◽

Coding Potential ◽

Reading Frames

AbstractIn RNA field, the demarcation between coding and non-coding has been negotiated by the recent discovery of occasionally translated circular RNAs (circRNAs). Although absent of 5’ cap structure, circRNAs can be translated cap-independently. Complementary intron-mediated overexpression is one of the most utilized methodologies for circRNA research but not without bearing echoing skepticism for its poorly defined mechanism and latent coexistent side products. In this study, leveraging such circRNA overexpression system, we have interrogated the protein-coding potential of 30 human circRNAs containing infinite open reading frames in HEK293T cells. Surprisingly, pervasive translation signals are detected by immunoblotting. However, intensive mutagenesis reveals that numerous translation signals are generated independently of circRNA synthesis. We have developed a dual tag strategy to isolate translation noise and directly demonstrate that the fallacious translation signals originate from cryptically spliced linear transcripts. The concomitant linear RNA byproducts, presumably concatemers, can be translated to allow pseudo rolling circle translation signals, and can involve backsplicing junction (BSJ) to disqualify the BSJ-based evidence for circRNA translation. We also find non-AUG start codons may engage in the translation initiation of circRNAs. Taken together, our systematic evaluation sheds light on heterogeneous translational outputs from circRNA overexpression vector and comes with a caveat that ectopic overexpression technique necessitates extremely rigorous control setup in circRNA translation and functional investigation.

Download Full-text

A micropeptide encoded by lncRNA MIR155HG suppresses autoimmune inflammation via modulating antigen presentation

Science Advances ◽

10.1126/sciadv.aaz2059 ◽

2020 ◽

Vol 6 (21) ◽

pp. eaaz2059 ◽

Cited By ~ 4

Author(s):

Liman Niu ◽

Fangzhou Lou ◽

Yang Sun ◽

Libo Sun ◽

Xiaojie Cai ◽

...

Keyword(s):

Antigen Presentation ◽

Inflammatory Diseases ◽

Open Reading Frames ◽

Protein Coding ◽

Histocompatibility Complex ◽

Antigen Trafficking ◽

Heat Shock Cognate Protein ◽

Antigen Presenting ◽

Cognate Protein ◽

Reading Frames

Many annotated long noncoding RNAs (lncRNAs) harbor predicted short open reading frames (sORFs), but the coding capacities of these sORFs and the functions of the resulting micropeptides remain elusive. Here, we report that human lncRNA MIR155HG encodes a 17–amino acid micropeptide, which we termed miPEP155 (P155). MIR155HG is highly expressed by inflamed antigen-presenting cells, leading to the discovery that P155 interacts with the adenosine 5′-triphosphate binding domain of heat shock cognate protein 70 (HSC70), a chaperone required for antigen trafficking and presentation in dendritic cells (DCs). P155 modulates major histocompatibility complex class II–mediated antigen presentation and T cell priming by disrupting the HSC70-HSP90 machinery. Exogenously injected P155 improves two classical mouse models of DC-driven auto inflammation. Collectively, we demonstrate the endogenous existence of a micropeptide encoded by a transcript annotated as “non-protein coding” and characterize a micropeptide as a regulator of antigen presentation and a suppressor of inflammatory diseases.

Download Full-text