Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders

David Zhang; Sebastian Guelfi; Sonia Garcia-Ruiz; Beatrice Costa; Regina H. Reynolds; Karishma D’Sa; Wenfei Liu; Thomas Courtin; Amy Peterson; Andrew E. Jaffe; John Hardy; Juan A. Botía; Leonardo Collado-Torres; Mina Ryten

doi:10.1126/sciadv.aay8299

Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders

Science Advances ◽

10.1126/sciadv.aay8299 ◽

2020 ◽

Vol 6 (24) ◽

pp. eaay8299 ◽

Cited By ~ 7

Author(s):

David Zhang ◽

Sebastian Guelfi ◽

Sonia Garcia-Ruiz ◽

Beatrice Costa ◽

Regina H. Reynolds ◽

...

Keyword(s):

Human Gene ◽

Gene Annotation ◽

Tissue Expression ◽

Mendelian Inheritance ◽

Disease Genes ◽

Human Tissues ◽

Sequencing Data ◽

Protein Coding ◽

Neurogenetic Disorders ◽

Different Tissues

Growing evidence suggests that human gene annotation remains incomplete; however, it is unclear how this affects different tissues and our understanding of different disorders. Here, we detect previously unannotated transcription from Genotype-Tissue Expression RNA sequencing data across 41 human tissues. We connect this unannotated transcription to known genes, confirming that human gene annotation remains incomplete, even among well-studied genes including 63% of the Online Mendelian Inheritance in Man–morbid catalog and 317 neurodegeneration-associated genes. We find the greatest abundance of unannotated transcription in brain and genes highly expressed in brain are more likely to be reannotated. We explore examples of reannotated disease genes, such as SNCA, for which we experimentally validate a previously unidentified, brain-specific, potentially protein-coding exon. We release all tissue-specific transcriptomes through vizER: http://rytenlab.com/browser/app/vizER. We anticipate that this resource will facilitate more accurate genetic analysis, with the greatest impact on our understanding of Mendelian and complex neurogenetic disorders.

Download Full-text

Incomplete annotation of disease-associated genes is limiting our understanding of Mendelian and complex neurogenetic disorders

10.1101/499103 ◽

2018 ◽

Cited By ~ 1

Author(s):

David Zhang ◽

Sebastian Guelfi ◽

Sonia Garcia Ruiz ◽

Beatrice Costa ◽

Regina H. Reynolds ◽

...

Keyword(s):

Gene Annotation ◽

Brain Regions ◽

Sequencing Data ◽

Protein Coding ◽

Brain Transcriptome ◽

Genomics Research ◽

Neurological Phenotype ◽

Neurogenetic Disorders ◽

Disease Associated Genes ◽

The Brain

AbstractThere is growing evidence to suggest that human gene annotation remains incomplete, with a disproportionate impact on the brain transcriptome. We used RNA-sequencing data from GTEx to detect novel transcription in an annotation-agnostic manner across 13 human brain regions and 28 human tissues. We found that genes highly expressed in brain are significantly more likely to be re-annotated, as are genes associated with Mendelian and complex neurodegenerative disorders. We improved the annotation of 63% of known OMIM-morbid genes and 65% of those with a neurological phenotype. We determined that novel transcribed regions, particularly those identified in brain, tend to be poorly conserved across mammals but are significantly depleted for genetic variation within humans. As exemplified by SNCA, we explored the implications of re-annotation for Mendelian and complex Parkinson’s disease. We validated in silico and experimentally a novel, brain-specific, potentially protein-coding exon of SNCA. We release our findings as tissue-specific transcriptomes in BED format and via vizER: http://rytenlab.com/browser/app/vizER. Together these resources will facilitate basic genomics research with the greatest impact on neurogenetics.

Download Full-text

Nearly all new protein-coding predictions in the CHESS database are not protein-coding

10.1101/360602 ◽

2018 ◽

Cited By ~ 5

Author(s):

Irwin Jungreis ◽

Michael L. Tress ◽

Jonathan Mudge ◽

Cristina Sisu ◽

Toby Hunt ◽

...

Keyword(s):

Mass Spectrometry ◽

False Positive ◽

Human Gene ◽

Noncoding Rnas ◽

Evolutionary Conservation ◽

Protein Domain ◽

Human Tissues ◽

Protein Coding ◽

Gene Annotations ◽

New Protein

AbstractIn a 2018 paper posted to bioRxiv, Pertea et al. presented the CHESS database, a new catalog of human gene annotations that includes 1,178 new protein-coding predictions. These are based on evidence of transcription in human tissues and homology to earlier annotations in human and other mammals. Here, we reanalyze the evidence used by CHESS, and find that nearly all protein-coding predictions are false positives. We find that 86% overlap transposons marked by RepeatMasker that are known to frequently result in false positive protein-coding predictions. More than half are homologous to only nine Alu-derived primate sequences corresponding to an erroneous and previously withdrawn Pfam protein domain. The entire set shows poor evolutionary conservation and PhyloCSF protein-coding evolutionary signatures indistinguishable from noncoding RNAs, indicating lack of protein-coding constraint. Only four predictions are supported by mass spectrometry evidence, and even those matches are inconclusive. Overall, the new protein-coding predictions are unsupported by any credible experimental or evolutionary evidence of function, result primarily from homology to genes incorrectly classified as protein-coding, and are unlikely to encode functional proteins.

Download Full-text

Adhesion GPCR GPR56 Expression Profiling in Human Tissues

Cells ◽

10.3390/cells10123557 ◽

2021 ◽

Vol 10 (12) ◽

pp. 3557

Author(s):

Fyn Kaiser ◽

Markus Morawski ◽

Knut Krohn ◽

Nada Rayes ◽

Cheng-Chih Hsiao ◽

...

Keyword(s):

Specific Binding ◽

Tissue Expression ◽

Adult Brain ◽

Human Tissues ◽

Sequencing Data ◽

Physiological Processes ◽

Pepsinogen A ◽

The Central Nervous System ◽

Functional Relevance ◽

Adhesion Gpcr

Despite the immense functional relevance of GPR56 (gene ADGRG1) in highly diverse (patho)physiological processes such as tumorigenesis, immune regulation, and brain development, little is known about its exact tissue localization. Here, we validated antibodies for GPR56-specific binding using cells with tagged GPR56 or eliminated ADGRG1 in immunotechniques. Using the most suitable antibody, we then established the human GPR56 tissue expression profile. Overall, ADGRG1 RNA-sequencing data of human tissues and GPR56 protein expression correlate very well. In the adult brain especially, microglia are GPR56-positive. Outside the central nervous system, GPR56 is frequently expressed in cuboidal or highly prismatic secreting epithelia. High ADGRG1 mRNA, present in the thyroid, kidney, and placenta is related to elevated GPR56 in thyrocytes, kidney tubules, and the syncytiotrophoblast, respectively. GPR56 often appears in association with secreted proteins such as pepsinogen A in gastric chief cells and insulin in islet β-cells. In summary, GPR56 shows a broad, not cell-type restricted expression in humans.

Download Full-text

A limited set of transcriptional programs define major cell types

10.1101/857169 ◽

2019 ◽

Cited By ~ 2

Author(s):

Alessandra Breschi ◽

Manuel Muñoz-Aguirre ◽

Valentin Wucher ◽

Carrie A. Davis ◽

Diego Garrido-Martín ◽

...

Keyword(s):

Human Body ◽

Blood Cells ◽

Cell Types ◽

Human Tissues ◽

Primary Cells ◽

Cellular Composition ◽

Sequencing Data ◽

Morphological Heterogeneity ◽

Different Tissues ◽

Age And Sex

AbstractWe have produced RNA sequencing data for a number of primary cells from different locations in the human body. The clustering of these primary cells reveals that most cells in the human body share a few broad transcriptional programs, which define five major cell types: epithelial, endothelial, mesenchymal, neural and blood cells. These act as basic components of many tissues and organs. Based on gene expression, these cell types redefine the basic histological types by which tissues have been traditionally classified. We identified genes whose expression is specific to these cell types, and from these genes, we estimated the contribution of the major cell types to the composition of human tissues. We found this cellular composition to be a characteristic signature of tissues, and to reflect tissue morphological heterogeneity and histology. We identified changes in cellular composition in different tissues associated with age and sex and found that departures from the normal cellular composition correlate with histological phenotypes associated to disease.One Sentence SummaryA few broad transcriptional programs define the major cell types underlying the histology of human tissues and organs.

Download Full-text

Chromosome genome assembly and annotation of Artocarpus Nanchuanensis with Nanopore and Hi-C sequencing data

10.22541/au.160648749.91403595/v1 ◽

2020 ◽

Author(s):

Jiaoyu He ◽

Shanfei bao ◽

Junhang Deng ◽

Qiufu Li ◽

zhilin song ◽

...

Keyword(s):

Genome Assembly ◽

Tree Species ◽

Gene Annotation ◽

Sequence Information ◽

Sequencing Data ◽

Endangered Tree ◽

Protein Coding ◽

Sequencing Platform ◽

Endangered Tree Species ◽

Conserved Genes

The A.nanchuanensis (Artocarpus Nanchuanensis, Moraceae) is an evergreen Artocarpus genus representative tree species in the northernmost natural distribution and one of the extremely endangered tree species in China. In this study, we obtained a high-quality chromosome-scale genome assembly and annotation for A.nanchuanensis using inter-grated approaches, including Illumina, Nanopore sequencing platform as well as Hi-C. A total of 128.71 gigabases (Gb) raw Nanopore Sequel reads were generated from 20 kb libraries. After filtering, 123.38 Gb clean reads were obtained, giving 160.34x coverage depth. The final assembled A.nanchuanensis genome was 769.44 Mb with a contig N50 of 2.09 Mb, and 99.62% (766.50 Mb) of the assembly data was assigned to 28 pseudochromosomes. Gene modelling predicted 41,636 protein-coding genes, of which 95.10% were annotated. The gene annotation completeness was evaluated by BUSCO, and 94.44% conserved genes could be found in the assembly data. The disclosure of A.nanchuanensis genome sequence information provides an important resource to expand our understanding of the molecular mechanism in its unique biological processes and nutritional, medicinal benefits.

Download Full-text

LNCRNA expression landscape and specificity between brain regions

10.1101/2021.10.29.466410 ◽

2021 ◽

Author(s):

Adewale Joseph Ogunleye ◽

Umair Ali ◽

Michael Juwon Olufemi

Keyword(s):

Tissue Expression ◽

Brain Regions ◽

Cell Type ◽

Sequencing Data ◽

Protein Coding ◽

Rna Molecules ◽

The Subject ◽

Mrna Markers ◽

The Brain

Long noncoding RNAs (lncRNAs) are transcribed into low potential protein coding RNA molecules, which account for over 70% of mammalian transcriptional products. The role of lncRNAs and their expression is still largely unknown, and the subject of recent investigations. Here, we used bulk RNA sequencing data from the Genotype-Tissue Expression (GTEx) project to reveal the occurrence and identify the specificity of lncRNAs in 13 brain regions (1000 samples). We observed that these highly specific lncRNA were co-expressed with previously known mRNA markers for the 13 study regions of the brain. Further investigation revealed that splicing could influence the divergent biogenesis and enrichment of specific lncRNA alleles in different brain regions. Overall, we demonstrate the use of lncRNA as an independent tool for deconvolving brain regions and further highlights its use for cell-type identification from bulk transcriptome data.

Download Full-text

Polymorphic mobile element insertions contribute to gene expression and alternative splicing in human tissues

10.1101/2020.05.23.111310 ◽

2020 ◽

Author(s):

Xiaolong Cao ◽

Yeting Zhang ◽

Lindsay M Payer ◽

Hannah Lords ◽

Jared P Steranka ◽

...

Keyword(s):

Gene Expression ◽

Alternative Splicing ◽

Genome Sequencing ◽

Quantitative Trait ◽

Mobile Element ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Human Tissues ◽

Sequencing Data ◽

Different Tissues

AbstractBackgroundMobile elements are a major source of human structural variants and some mobile elements can regulate gene expression and alternative splicing. However, the impact of polymorphic mobile element insertions (pMEIs) on gene expression and splicing in diverse human tissues has not been thoroughly studied. The multi-tissue gene expression and whole genome sequencing data generated by the Genotype-Tissue Expression (GTEx) project provide a great opportunity to systematic determine pMEIs’ role in gene expression regulation in human tissues.ResultsUsing the GTEx whole genome sequencing data, we identified 20,545 high-quality pMEIs from 639 individuals. We then identified pMEI-associated expression quantitative trait loci (eQTLs) and splicing quantitative trait loci (sQTLs) in 48 tissues by joint analysis of variants including pMEIs, single-nucleotide polymorphisms, and insertions/deletions. pMEIs were predicted to be the potential causal variant for 3,522 of the 30,147 significant eQTLs, and 3,717 of the 21,529 significant sQTLs. The pMEIs associated eQTLs and sQTLs show high level of tissue-specificity, and the pMEIs were enriched in the proximity of affected genes and in regulatory elements. Using reporter assays, we confirmed that several pMEIs associated with eQTLs and sQTLs can alter gene expression levels and isoform proportions.ConclusionOverall, our study shows that pMEIs are associated with thousands of gene expression and splicing variations in different tissues, and pMEIs could have a significant role in regulating tissue-specific gene expression/splicing. Detailed mechanisms for pMEI’s role in gene regulation in different tissues will be an important direction for future human genomic studies.

Download Full-text

Distant regulatory effects of genetic variation in multiple human tissues

10.1101/074419 ◽

2016 ◽

Cited By ~ 4

Author(s):

Brian Jo ◽

Yuan He ◽

Benjamin J. Strober ◽

Princy Parsana ◽

François Aguet ◽

...

Keyword(s):

Genetic Variation ◽

Complex Traits ◽

Disease Risk ◽

Tissue Expression ◽

Human Tissues ◽

Sequencing Data ◽

Cellular Mechanisms ◽

Regulation Of Expression ◽

Tissue Specific ◽

Project Data

AbstractUnderstanding the genetics of gene regulation provides information on the cellular mechanisms through which genetic variation influences complex traits. Expression quantitative trait loci, or eQTLs, are enriched for polymorphisms that have been found to be associated with disease risk. While most analyses of human data has focused on regulation of expression by nearby variants (cis-eQTLs), distal or trans-eQTLs may have broader effects on the transcriptome and important phenotypic consequences, necessitating a comprehensive study of the effects of genetic variants on distal gene transcription levels. In this work, we identify trans-eQTLs in the Genotype Tissue Expression (GTEx) project data1, consisting of 449 individuals with RNA-sequencing data across 44 tissue types. We find 81 genes with a trans-eQTL in at least one tissue, and we demonstrate that trans-eQTLs are more likely than cis-eQTLs to have effects specific to a single tissue. We evaluate the genomic and functional properties of trans-eQTL variants, identifying strong enrichment in enhancer elements and Piwi-interacting RNA clusters. Finally, we describe three tissue-specific regulatory loci underlying relevant disease associations: 9q22 in thyroid that has a role in thyroid cancer, 5q31 in skeletal muscle, and a previously reported master regulator near KLF14 in adipose. These analyses provide a comprehensive characterization of trans-eQTLs across human tissues, which contribute to an improved understanding of the tissue-specific cellular mechanisms of regulatory genetic variation.

Download Full-text

Classifying gastric cancer using FLORA reveals clinically relevant molecular subtypes and highlights LINC01614 as a biomarker for patient prognosis

Oncogene ◽

10.1038/s41388-021-01743-3 ◽

2021 ◽

Author(s):

Yiyun Chen ◽

Wing Yin Cheng ◽

Hongyu Shi ◽

Shengshuo Huang ◽

Huarong Chen ◽

...

Keyword(s):

Gastric Cancer ◽

Noncoding Rna ◽

Molecular Subtype ◽

Sequencing Data ◽

Cell Growth And Migration ◽

Protein Coding ◽

Over Expression ◽

And Migration ◽

Patient Prognosis ◽

Subtype 3

AbstractMolecular-based classifications of gastric cancer (GC) were recently proposed, but few of them robustly predict clinical outcomes. While mutation and expression signature of protein-coding genes were used in previous molecular subtyping methods, the noncoding genome in GC remains largely unexplored. Here, we developed the fast long-noncoding RNA analysis (FLORA) method to study RNA sequencing data of GC cases, and prioritized tumor-specific long-noncoding RNAs (lncRNAs) by integrating clinical and multi-omic data. We uncovered 1235 tumor-specific lncRNAs, based on which three subtypes were identified. The lncRNA-based subtype 3 (L3) represented a subgroup of intestinal GC with worse survival, characterized by prevalent TP53 mutations, chromatin instability, hypomethylation, and over-expression of oncogenic lncRNAs. In contrast, the lncRNA-based subtype 1 (L1) has the best survival outcome, while LINC01614 expression further segregated a subgroup of L1 cases with worse survival and increased chance of developing distal metastasis. We demonstrated that LINC01614 over-expression is an independent prognostic factor in L1 and network-based functional prediction implicated its relevance to cell migration. Over-expression and CRISPR-Cas9-guided knockout experiments further validated the functions of LINC01614 in promoting GC cell growth and migration. Altogether, we proposed a lncRNA-based molecular subtype of GC that robustly predicts patient survival and validated LINC01614 as an oncogenic lncRNA that promotes GC proliferation and migration.

Download Full-text

Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis

Human Genomics ◽

10.1186/s40246-021-00336-1 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Zeeshan Ahmed ◽

Eduard Gibert Renart ◽

Saman Zeeshan ◽

XinQi Dong

Keyword(s):

Data Analysis ◽

Patient Care ◽

Expression Analysis ◽

High Throughput ◽

Gene Annotation ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Complex Disorders ◽

Transcriptomics Data

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.

Download Full-text