scholarly journals Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

2021 ◽  
Vol 9 (1) ◽  
pp. 129
Author(s):  
Katelyn McNair ◽  
Carol L. Ecale Zhou ◽  
Brian Souza ◽  
Stephanie Malfatti ◽  
Robert A. Edwards

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

2004 ◽  
Vol 78 (20) ◽  
pp. 11187-11197 ◽  
Author(s):  
Lisa M. Kattenhorn ◽  
Ryan Mills ◽  
Markus Wagner ◽  
Alexandre Lomsadze ◽  
Vsevolod Makeev ◽  
...  

ABSTRACT Proteins associated with the murine cytomegalovirus (MCMV) viral particle were identified by a combined approach of proteomic and genomic methods. Purified MCMV virions were dissociated by complete denaturation and subjected to either separation by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and in-gel digestion or treated directly by in-solution tryptic digestion. Peptides were separated by nanoflow liquid chromatography and analyzed by tandem mass spectrometry (LC-MS/MS). The MS/MS spectra obtained were searched against a database of MCMV open reading frames (ORFs) predicted to be protein coding by an MCMV-specific version of the gene prediction algorithm GeneMarkS. We identified 38 proteins from the capsid, tegument, glycoprotein, replication, and immunomodulatory protein families, as well as 20 genes of unknown function. Observed irregularities in coding potential suggested possible sequence errors in the 3′-proximal ends of m20 and M31. These errors were experimentally confirmed by sequencing analysis. The MS data further indicated the presence of peptides derived from the unannotated ORFs ORFc225441-226898 (m166.5) and ORF105932-106072. Immunoblot experiments confirmed expression of m166.5 during viral infection.


2019 ◽  
Vol 294 (3) ◽  
pp. 637-647 ◽  
Author(s):  
Yong Wang ◽  
Zhen Zeng ◽  
Tian-Lei Liu ◽  
Ling Sun ◽  
Qin Yao ◽  
...  

2018 ◽  
Vol 64 (5) ◽  
pp. 339-348 ◽  
Author(s):  
Talal George Abboud ◽  
Abdullah Zubaer ◽  
Alvan Wai ◽  
Georg Hausner

Ophiostoma novo-ulmi, a member of the Ophiostomatales (Ascomycota), is the causal agent of the current Dutch elm disease pandemic in Europe and North America. The complete mitochondrial genome (mtDNA) of Ophiostoma novo-ulmi subsp. novo-ulmi, the European component of O. novo-ulmi, has been sequenced and annotated. Gene order (synteny) among the currently available members of the Ophiostomatales was examined and appears to be conserved, and mtDNA size variability among the Ophiostomatales is due in part to the presence of introns and their encoded open reading frames. Phylogenetic analysis of concatenated mitochondrial protein-coding genes yielded phylogenetic estimates for various members of the Ophiostomatales, with strong statistical support showing that mtDNA analysis may provide valuable insights into the evolution of the Ophiostomatales.


2019 ◽  
Vol 5 (3) ◽  
pp. 46 ◽  
Author(s):  
Anton Goustin ◽  
Pattaraporn Thepsuwan ◽  
Mary Kosir ◽  
Leonard Lipovich

Long non-coding RNA (lncRNA) genes encode non-messenger RNAs that lack open reading frames (ORFs) longer than 300 nucleotides, lack evolutionary conservation in their shorter ORFs, and do not belong to any classical non-coding RNA category. LncRNA genes equal, or exceed in number, protein-coding genes in mammalian genomes. Most mammalian genomes harbor ~20,000 protein-coding genes that give rise to conventional messenger RNA (mRNA) transcripts. These coding genes exhibit sweeping evolutionary conservation in their ORFs. LncRNAs function via different mechanisms, including but not limited to: (1) serving as “enhancer” RNAs regulating nearby coding genes in cis; (2) functioning as scaffolds to create ribonucleoprotein (RNP) complexes; (3) serving as sponges for microRNAs; (4) acting as ribo-mimics of consensus transcription factor binding sites in genomic DNA; (5) hybridizing to other nucleic acids (mRNAs and genomic DNA); and, rarely, (6) as templates encoding small open reading frames (smORFs) that may encode short proteins. Any given lncRNA may have more than one of these functions. This review focuses on one fascinating case—the growth-arrest-specific (GAS)-5 gene, encoding a complicated repertoire of alternatively-spliced lncRNA isoforms. GAS5 is also a host gene of numerous small nucleolar (sno) RNAs, which are processed from its introns. Publications about this lncRNA date back over three decades, covering its role in cell proliferation, cell differentiation, and cancer. The GAS5 story has drawn in contributions from prominent molecular geneticists who attempted to define its tumor suppressor function in mechanistic terms. The evidence suggests that rodent Gas5 and human GAS5 functions may be different, despite the conserved multi-exonic architecture featuring intronic snoRNAs, and positional conservation on syntenic chromosomal regions indicating that the rodent Gas5 gene is the true ortholog of the GAS5 gene in man and other apes. There is no single answer to the molecular mechanism of GAS5 action. Our goal here is to summarize competing, not mutually exclusive, mechanistic explanations of GAS5 function that have compelling experimental support.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Robin-Lee Troskie ◽  
Yohaann Jafrani ◽  
Tim R. Mercer ◽  
Adam D. Ewing ◽  
Geoffrey J. Faulkner ◽  
...  

AbstractPseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
David S. M. Lee ◽  
Joseph Park ◽  
Andrew Kromer ◽  
Aris Baras ◽  
Daniel J. Rader ◽  
...  

AbstractRibosome-profiling has uncovered pervasive translation in non-canonical open reading frames, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals, respectively, and elucidate their impact on protein expression in human cells. Our results suggest translation disrupting mechanisms relating uORF variation to reduced protein expression, and demonstrate that translation at uORFs is genetically constrained in 50% of human genes.


1995 ◽  
Vol 128 (1) ◽  
pp. 51-60 ◽  
Author(s):  
M Way ◽  
M Sanders ◽  
C Garcia ◽  
J Sakai ◽  
P Matsudaira

The acrosomal process of Limulus sperm is an 80-microns long finger of membrane supported by a crystalline bundle of actin filaments. The filaments in this bundle are crosslinked by a 102-kD protein, scruin present in a 1:1 molar ratio with actin. Recent image reconstruction of scruin decorated actin filaments at 13-A resolution shows that scruin is organized into two equally sized domains bound to separate actin subunits in the same filament. We have cloned and sequenced the gene for scruin from a Limulus testes cDNA library. The deduced amino acid sequence of scruin reflects the domain organization of scruin: it consists of a tandem pair of homologous domains joined by a linker region. The domain organization of scruin is confirmed by limited proteolysis of the purified acrosomal process. Three different proteases cleave the native protein in a 5-kD Protease-sensitive region in the middle of the molecule to generate an NH2-terminal 47-kD and a COOH-terminal 56-kD protease-resistant domains. Although the protein sequence of scruin has no homology to any known actin-binding protein, it has similarities to several proteins, including four open reading frames of unknown function in poxviruses, as well as kelch, a Drosophila protein localized to actin-rich ring canals. All proteins that show homologies to scruin are characterized by the presence of an approximately 50-amino acid residue motif that is repeated between two and seven times. Crystallographic studies reveal this motif represents a four beta-stranded fold that is characteristic of the "superbarrel" structural fold found in the sialidase family of proteins. These results suggest that the two domains of scruin seen in EM reconstructions are superbarrel folds, and they present the possibility that other members of this family may also bind actin.


2016 ◽  
Author(s):  
Peter D. Keightley ◽  
Jose Campos ◽  
Tom Booker ◽  
Brian Charlesworth

Many approaches for inferring adaptive molecular evolution analyze the unfolded site frequency spectrum (SFS), a vector of counts of sites with different numbers of copies of derived alleles in a sample of alleles from a population. Accurate inference of the high copy number elements of the SFS is difficult, however, because of misassignment of alleles as derived versus ancestral. This is a known problem with parsimony using outgroup species. Here, we show that the problem is particularly serious if there is variation in the substitution rate among sites brought about by variation in selective constraint levels. We present a new method for inferring the SFS using one or two outgroups, which attempts to overcome the problem of misassignment. We show that two outgroups are required for accurate estimation of the SFS if there is substantial variation in selective constraints, which is expected to be the case for nonsynonymous sites of protein-coding genes. We apply the method to estimate unfolded SFSs for synonymous and nonsynonymous sites from Phase 2 of the Drosophila Population Genomics Project. We use the unfolded spectra to estimate the frequency and strength of advantageous and deleterious mutations, and estimate that ~50% of amino acid substitutions are positively selected, but that less than 0.5% of new amino acid mutations are beneficial, with a scaled selection strength of Nes ≈ 12.


2021 ◽  
Author(s):  
Yanyi Jiang ◽  
Xiaofan Chen ◽  
Wei Zhang

AbstractIn RNA field, the demarcation between coding and non-coding has been negotiated by the recent discovery of occasionally translated circular RNAs (circRNAs). Although absent of 5’ cap structure, circRNAs can be translated cap-independently. Complementary intron-mediated overexpression is one of the most utilized methodologies for circRNA research but not without bearing echoing skepticism for its poorly defined mechanism and latent coexistent side products. In this study, leveraging such circRNA overexpression system, we have interrogated the protein-coding potential of 30 human circRNAs containing infinite open reading frames in HEK293T cells. Surprisingly, pervasive translation signals are detected by immunoblotting. However, intensive mutagenesis reveals that numerous translation signals are generated independently of circRNA synthesis. We have developed a dual tag strategy to isolate translation noise and directly demonstrate that the fallacious translation signals originate from cryptically spliced linear transcripts. The concomitant linear RNA byproducts, presumably concatemers, can be translated to allow pseudo rolling circle translation signals, and can involve backsplicing junction (BSJ) to disqualify the BSJ-based evidence for circRNA translation. We also find non-AUG start codons may engage in the translation initiation of circRNAs. Taken together, our systematic evaluation sheds light on heterogeneous translational outputs from circRNA overexpression vector and comes with a caveat that ectopic overexpression technique necessitates extremely rigorous control setup in circRNA translation and functional investigation.


2020 ◽  
Vol 6 (21) ◽  
pp. eaaz2059 ◽  
Author(s):  
Liman Niu ◽  
Fangzhou Lou ◽  
Yang Sun ◽  
Libo Sun ◽  
Xiaojie Cai ◽  
...  

Many annotated long noncoding RNAs (lncRNAs) harbor predicted short open reading frames (sORFs), but the coding capacities of these sORFs and the functions of the resulting micropeptides remain elusive. Here, we report that human lncRNA MIR155HG encodes a 17–amino acid micropeptide, which we termed miPEP155 (P155). MIR155HG is highly expressed by inflamed antigen-presenting cells, leading to the discovery that P155 interacts with the adenosine 5′-triphosphate binding domain of heat shock cognate protein 70 (HSC70), a chaperone required for antigen trafficking and presentation in dendritic cells (DCs). P155 modulates major histocompatibility complex class II–mediated antigen presentation and T cell priming by disrupting the HSC70-HSP90 machinery. Exogenously injected P155 improves two classical mouse models of DC-driven auto inflammation. Collectively, we demonstrate the endogenous existence of a micropeptide encoded by a transcript annotated as “non-protein coding” and characterize a micropeptide as a regulator of antigen presentation and a suppressor of inflammatory diseases.


Sign in / Sign up

Export Citation Format

Share Document