Nearly all new protein-coding predictions in the CHESS database are not protein-coding

Mapping Intimacies ◽

10.1101/360602 ◽

2018 ◽

Cited By ~ 5

Author(s):

Irwin Jungreis ◽

Michael L. Tress ◽

Jonathan Mudge ◽

Cristina Sisu ◽

Toby Hunt ◽

...

Keyword(s):

Mass Spectrometry ◽

False Positive ◽

Human Gene ◽

Noncoding Rnas ◽

Evolutionary Conservation ◽

Protein Domain ◽

Human Tissues ◽

Protein Coding ◽

Gene Annotations ◽

New Protein

AbstractIn a 2018 paper posted to bioRxiv, Pertea et al. presented the CHESS database, a new catalog of human gene annotations that includes 1,178 new protein-coding predictions. These are based on evidence of transcription in human tissues and homology to earlier annotations in human and other mammals. Here, we reanalyze the evidence used by CHESS, and find that nearly all protein-coding predictions are false positives. We find that 86% overlap transposons marked by RepeatMasker that are known to frequently result in false positive protein-coding predictions. More than half are homologous to only nine Alu-derived primate sequences corresponding to an erroneous and previously withdrawn Pfam protein domain. The entire set shows poor evolutionary conservation and PhyloCSF protein-coding evolutionary signatures indistinguishable from noncoding RNAs, indicating lack of protein-coding constraint. Only four predictions are supported by mass spectrometry evidence, and even those matches are inconclusive. Overall, the new protein-coding predictions are unsupported by any credible experimental or evolutionary evidence of function, result primarily from homology to genes incorrectly classified as protein-coding, and are unlikely to encode functional proteins.

Download Full-text

Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders

Science Advances ◽

10.1126/sciadv.aay8299 ◽

2020 ◽

Vol 6 (24) ◽

pp. eaay8299 ◽

Cited By ~ 7

Author(s):

David Zhang ◽

Sebastian Guelfi ◽

Sonia Garcia-Ruiz ◽

Beatrice Costa ◽

Regina H. Reynolds ◽

...

Keyword(s):

Human Gene ◽

Gene Annotation ◽

Tissue Expression ◽

Mendelian Inheritance ◽

Disease Genes ◽

Human Tissues ◽

Sequencing Data ◽

Protein Coding ◽

Neurogenetic Disorders ◽

Different Tissues

Growing evidence suggests that human gene annotation remains incomplete; however, it is unclear how this affects different tissues and our understanding of different disorders. Here, we detect previously unannotated transcription from Genotype-Tissue Expression RNA sequencing data across 41 human tissues. We connect this unannotated transcription to known genes, confirming that human gene annotation remains incomplete, even among well-studied genes including 63% of the Online Mendelian Inheritance in Man–morbid catalog and 317 neurodegeneration-associated genes. We find the greatest abundance of unannotated transcription in brain and genes highly expressed in brain are more likely to be reannotated. We explore examples of reannotated disease genes, such as SNCA, for which we experimentally validate a previously unidentified, brain-specific, potentially protein-coding exon. We release all tissue-specific transcriptomes through vizER: http://rytenlab.com/browser/app/vizER. We anticipate that this resource will facilitate more accurate genetic analysis, with the greatest impact on our understanding of Mendelian and complex neurogenetic disorders.

Download Full-text

Faculty Opinions recommendation of Evaluating phylogenetic informativeness and data-type usage for new protein-coding genes across Vertebrata.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13360106.14730258 ◽

2011 ◽

Author(s):

Rafael Zardoya ◽

Diego San Mauro

Keyword(s):

Data Type ◽

Protein Coding ◽

Protein Coding Genes ◽

Phylogenetic Informativeness ◽

New Protein

Download Full-text

PlncRNADB: A Repository of Plant lncRNAs and lncRNA-RBP Protein Interactions

Current Bioinformatics ◽

10.2174/1574893614666190131161002 ◽

2019 ◽

Vol 14 (7) ◽

pp. 621-627 ◽

Cited By ~ 3

Author(s):

Youhuang Bai ◽

Xiaozhuan Dai ◽

Tiantian Ye ◽

Peijing Zhang ◽

Xu Yan ◽

...

Keyword(s):

Protein Interactions ◽

Binding Proteins ◽

Rna Binding ◽

Rna Binding Proteins ◽

Populus Trichocarpa ◽

Noncoding Rnas ◽

Reference Database ◽

Protein Coding ◽

Arabidopsis Lyrata ◽

User Friendly

Background: Long noncoding RNAs (lncRNAs) are endogenous noncoding RNAs, arbitrarily longer than 200 nucleotides, that play critical roles in diverse biological processes. LncRNAs exist in different genomes ranging from animals to plants. Objective: PlncRNADB is a searchable database of lncRNA sequences and annotation in plants. Methods: We built a pipeline for lncRNA prediction in plants, providing a convenient utility for users to quickly distinguish potential noncoding RNAs from protein-coding transcripts. Results: More than five thousand lncRNAs are collected from four plant species (Arabidopsis thaliana, Arabidopsis lyrata, Populus trichocarpa and Zea mays) in PlncRNADB. Moreover, our database provides the relationship between lncRNAs and various RNA-binding proteins (RBPs), which can be displayed through a user-friendly web interface. Conclusion: PlncRNADB can serve as a reference database to investigate the lncRNAs and their interaction with RNA-binding proteins in plants. The PlncRNADB is freely available at http://bis.zju.edu.cn/PlncRNADB/.

Download Full-text

Chromosomal assembly of the nuclear genome of the endosymbiont-bearing trypanosomatid Angomonas deanei

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkaa018 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

John W Davey ◽

Carolina M C Catta-Preta ◽

Sally James ◽

Sarah Forrester ◽

Maria Cristina M Motta ◽

...

Keyword(s):

Chromosome Number ◽

Noncoding Rnas ◽

Nuclear Genome ◽

Supernumerary Chromosome ◽

Ribosomal Rnas ◽

Protein Coding ◽

Transfer Rnas ◽

Protein Coding Genes ◽

Oxford Nanopore ◽

Genome Assemblies

Abstract Angomonas deanei is an endosymbiont-bearing trypanosomatid with several highly fragmented genome assemblies and unknown chromosome number. We present an assembly of the A. deanei nuclear genome based on Oxford Nanopore sequence that resolves into 29 complete or close-to-complete chromosomes. The assembly has several previously unknown special features; it has a supernumerary chromosome, a chromosome with a 340-kb inversion, and there is a translocation between two chromosomes. We also present an updated annotation of the chromosomal genome with 10,365 protein-coding genes, 59 transfer RNAs, 26 ribosomal RNAs, and 62 noncoding RNAs.

Download Full-text

Genome Sequence of the Asian Honeybee in Pakistan Sheds Light on Its Phylogenetic Relationship with Other Honeybees

Insects ◽

10.3390/insects12070652 ◽

2021 ◽

Vol 12 (7) ◽

pp. 652

Author(s):

Hongwei Tan ◽

Muhammad Naeem ◽

Hussain Ali ◽

Muhammad Shakeel ◽

Haiou Kuang ◽

...

Keyword(s):

Phylogenetic Relationship ◽

Genome Sequence ◽

Apis Cerana ◽

Gc Content ◽

Protein Domain ◽

Pollination Services ◽

Protein Coding ◽

Close Relationship ◽

Genome Scale ◽

Asian Honeybee

In Pakistan, Apis cerana, the Asian honeybee, has been used for honey production and pollination services. However, its genomic makeup and phylogenetic relationship with those in other countries are still unknown. We collected A. cerana samples from the main cerana-keeping region in Pakistan and performed whole genome sequencing. A total of 28 Gb of Illumina shotgun reads were generated, which were used to assemble the genome. The obtained genome assembly had a total length of 214 Mb, with a GC content of 32.77%. The assembly had a scaffold N50 of 2.85 Mb and a BUSCO completeness score of 99%, suggesting a remarkably complete genome sequence for A. cerana in Pakistan. A MAKER pipeline was employed to annotate the genome sequence, and a total of 11,864 protein-coding genes were identified. Of them, 6750 genes were assigned at least one GO term, and 8813 genes were annotated with at least one protein domain. Genome-scale phylogeny analysis indicated an unexpectedly close relationship between A. cerana in Pakistan and those in China, suggesting a potential human introduction of the species between the two countries. Our results will facilitate the genetic improvement and conservation of A. cerana in Pakistan.

Download Full-text

MicroRNomics: a newly emerging approach for disease biology

Physiological Genomics ◽

10.1152/physiolgenomics.00034.2008 ◽

2008 ◽

Vol 33 (2) ◽

pp. 139-147 ◽

Cited By ~ 141

Author(s):

Chunxiang Zhang

Keyword(s):

Gene Expression ◽

Genomic Medicine ◽

Noncoding Rnas ◽

Cellular Tissue ◽

Tissue Type ◽

Regulation Of Expression ◽

Protein Coding ◽

Small Noncoding Rnas ◽

Disease Biology ◽

Genomic Scale

Genomic evidence reveals that gene expression in humans is precisely controlled in cellular, tissue-type, temporal, and condition-specific manners. Completely understanding the regulatory mechanisms of gene expression is therefore one of the most important issues in genomic medicine. Surprisingly, recent analyses of the human and animal genomes have demonstrated that the majority of RNA transcripts are relatively small, noncoding RNAs (sncRNAs), rather than large, protein coding message RNAs (mRNAs). Moreover, these sncRNAs may represent a novel important layer of regulation for gene expression. The most important breakthrough in this new area is the discovery of microRNAs (miRNAs). miRNAs comprise a novel class of endogenous, small, noncoding RNAs that negatively regulate gene expression via degradation or translational inhibition of their target mRNAs. As a group, miRNAs may directly regulate ∼30% of the genes in the human genome. In keeping with the nomenclature of RNomics, which is to study sncRNAs on the genomic scale, “microRNomics” is coined here to describe a novel subdiscipline of genomics that studies the identification, expression, biogenesis, structure, regulation of expression, targets, and biological functions of miRNAs on the genomic scale. A growing body of exciting evidence suggests that miRNAs are important regulators of cell differentiation, proliferation/growth, mobility, and apoptosis. These miRNAs therefore play important roles in development and physiology. Consequently, dysregulation of miRNA function may lead to human diseases such as cancer, cardiovascular disease, liver disease, immune dysfunction, and metabolic disorders. microRNomics may be a newly emerging approach for human disease biology.

Download Full-text

Annotation of snoRNA abundance across human tissues reveals complex snoRNA-host gene relationships

Genome Biology ◽

10.1186/s13059-021-02391-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Étienne Fafard-Couture ◽

Danny Bergeron ◽

Sonia Couture ◽

Sherif Abou-Elela ◽

Michelle S. Scott

Keyword(s):

Housekeeping Genes ◽

Host Gene ◽

Rna Modification ◽

Human Tissues ◽

Rna Seq ◽

Healthy Human ◽

Protein Coding ◽

Conservation Level ◽

Nucleolar Rnas ◽

Host Genes

Abstract Background Small nucleolar RNAs (snoRNAs) are mid-size non-coding RNAs required for ribosomal RNA modification, implying a ubiquitous tissue distribution linked to ribosome synthesis. However, increasing numbers of studies identify extra-ribosomal roles of snoRNAs in modulating gene expression, suggesting more complex snoRNA abundance patterns. Therefore, there is a great need for mapping the snoRNome in different human tissues as the blueprint for snoRNA functions. Results We used a low structure bias RNA-Seq approach to accurately quantify snoRNAs and compare them to the entire transcriptome in seven healthy human tissues (breast, ovary, prostate, testis, skeletal muscle, liver, and brain). We identify 475 expressed snoRNAs categorized in two abundance classes that differ significantly in their function, conservation level, and correlation with their host gene: 390 snoRNAs are uniformly expressed and 85 are enriched in the brain or reproductive tissues. Most tissue-enriched snoRNAs are embedded in lncRNAs and display strong correlation of abundance with them, whereas uniformly expressed snoRNAs are mostly embedded in protein-coding host genes and are mainly non- or anticorrelated with them. Fifty-nine percent of the non-correlated or anticorrelated protein-coding host gene/snoRNA pairs feature dual-initiation promoters, compared to only 16% of the correlated non-coding host gene/snoRNA pairs. Conclusions Our results demonstrate that snoRNAs are not a single homogeneous group of housekeeping genes but include highly regulated tissue-enriched RNAs. Indeed, our work indicates that the architecture of snoRNA host genes varies to uncouple the host and snoRNA expressions in order to meet the different snoRNA abundance levels and functional needs of human tissues.

Download Full-text

Unraveling Toxoplasma gondii GT1 Strain Virulence and New Protein-Coding Genes with Proteogenomic Analyses

OMICS A Journal of Integrative Biology ◽

10.1089/omi.2021.0082 ◽

2021 ◽

Author(s):

Neelam Antil ◽

Manish Kumar ◽

Santosh Kumar Behera ◽

Mohammad Arefian ◽

Chinmaya Narayana Kotimoole ◽

...

Keyword(s):

Toxoplasma Gondii ◽

Protein Coding ◽

Protein Coding Genes ◽

New Protein

Download Full-text

Long Noncoding RNAs in Plants

Annual Review of Plant Biology ◽

10.1146/annurev-arplant-093020-035446 ◽

2021 ◽

Vol 72 (1) ◽

Author(s):

Andrzej T. Wierzbicki ◽

Todd Blevins ◽

Szymon Swiezewski

Keyword(s):

Gene Expression ◽

Nuclear Dna ◽

Noncoding Rnas ◽

Chromatin Accessibility ◽

Long Noncoding Rnas ◽

Annual Review ◽

Publication Date ◽

Protein Coding ◽

Ribonucleic Acids ◽

Coding Potential

Plants have an extraordinary diversity of transcription machineries, including five nuclear DNA-dependent RNA polymerases. Four of these enzymes are dedicated to the production of long noncoding RNAs (lncRNAs), which are ribonucleic acids with functions independent of their protein-coding potential. lncRNAs display a broad range of lengths and structures, but they are distinct from the small RNA guides of RNA interference (RNAi) pathways. lncRNAs frequently serve as structural, catalytic, or regulatory molecules for gene expression. They can affect all elements of genes, including promoters, untranslated regions, exons, introns, and terminators, controlling gene expression at various levels, including modifying chromatin accessibility, transcription, splicing, and translation. Certain lncRNAs protect genome integrity, while others respond to environmental cues like temperature, drought, nutrients, and pathogens. In this review, we explain the challenge of defining lncRNAs, introduce the machineries responsible for their production, and organize this knowledge by viewing the functions of lncRNAs throughout the structure of a typical plant gene. Expected final online publication date for the Annual Review of Plant Biology, Volume 72 is May 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

EnTAP: Bringing Faster and Smarter Functional Annotation to Non-Model Eukaryotic Transcriptomes

10.1101/307868 ◽

2018 ◽

Cited By ~ 5

Author(s):

Alexander J. Hart ◽

Samuel Ginzburg ◽

Muyang (Sam) Xu ◽

Cera R. Fisher ◽

Nasim Rahmatpour ◽

...

Keyword(s):

Similarity Search ◽

De Novo ◽

Gene Annotation ◽

Enrichment Analysis ◽

Orthologous Gene ◽

Protein Domain ◽

Family Assessment ◽

Ontology Term ◽

Protein Coding ◽

Functional Gene Annotation

ABSTRACTEnTAP (Eukaryotic Non-Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates, while focusing primarily on protein-coding transcripts. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins. Downstream features include fast similarity search across three repositories, protein domain assignment, orthologous gene family assessment, and Gene Ontology term assignment. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove contaminants, identify associated pathways, and prepare the transcripts for enrichment analysis. This fully featured pipeline is easy to install, configure, and runs significantly faster than comparable annotation packages. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources.

Download Full-text