Thousands of large-scale RNA sequencing experiments yield a comprehensive new human gene list and reveal extensive transcriptional noise

Mapping Intimacies ◽

10.1101/332825 ◽

2018 ◽

Cited By ~ 22

Author(s):

Mihaela Pertea ◽

Alaina Shumate ◽

Geo Pertea ◽

Ales Varabyou ◽

Yu-Chi Chang ◽

...

Keyword(s):

Rna Sequencing ◽

Large Scale ◽

Human Gene ◽

Splice Variants ◽

Gene List ◽

Transcriptional Noise ◽

Protein Coding ◽

Human Genes ◽

Gene Database ◽

Per Gene

AbstractWe assembled the sequences from 9,795 RNA sequencing experiments, collected from 31 human tissues and hundreds of subjects as part of the GTEx project, to create a new, comprehensive catalog of human genes and transcripts. The new human gene database contains 43,162 genes, of which 21,306 are protein-coding and 21,856 are noncoding, and a total of 323,824 transcripts, for an average of 7.5 transcripts per gene. Our expanded gene list includes 4,998 novel genes (1,178 coding and 3,819 noncoding) and 97,511 novel splice variants of protein-coding genes as compared to the most recent human gene catalogs. We detected over 30 million additional transcripts at more than 650,000 sites, nearly all of which are likely to be nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells.

Download Full-text

Faculty Opinions recommendation of CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.734505663.793563286 ◽

2019 ◽

Author(s):

Julie Thompson

Keyword(s):

Rna Sequencing ◽

Large Scale ◽

Human Gene ◽

Transcriptional Noise ◽

Gene Catalog

Download Full-text

CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise

Genome Biology ◽

10.1186/s13059-018-1590-2 ◽

2018 ◽

Vol 19 (1) ◽

Cited By ~ 74

Author(s):

Mihaela Pertea ◽

Alaina Shumate ◽

Geo Pertea ◽

Ales Varabyou ◽

Florian P. Breitwieser ◽

...

Keyword(s):

Rna Sequencing ◽

Large Scale ◽

Human Gene ◽

Transcriptional Noise ◽

Gene Catalog

Download Full-text

Loss of critical developmental and human disease-causing genes in 58 mammals

10.1101/819169 ◽

2019 ◽

Author(s):

Yatish Turakhia ◽

Heidi I. Chen ◽

Amir Marcovitz ◽

Gill Bejerano

Keyword(s):

Evolutionary Biology ◽

Large Scale ◽

Gene Annotation ◽

Synonymous Substitution ◽

Specific Gene ◽

High Confidence ◽

Protein Coding ◽

Congenital Diseases ◽

Manual Curation ◽

Human Genes

Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools and protein databases focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (deletion and non-synonymous substitution) as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence protein-coding gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using the hg38 human assembly as a reference, we discovered over 500 unique human genes affected by such high-confidence erosion events in different clades across 58 mammals. While most of these events likely have benign consequences, we also found dozens of clade-specific gene losses that result in early lethality in outgroup mammals or are associated with severe congenital diseases in humans. Our discoveries yield intriguing potential for translational medical genetics and for evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.

Download Full-text

Mapping the Human Herpesvirus 6B Transcriptome

Journal of Virology ◽

10.1128/jvi.01335-20 ◽

2021 ◽

Vol 95 (10) ◽

Author(s):

Annie Gravel ◽

Wes Sanders ◽

Éric Fournier ◽

Arnaud Droit ◽

Nathaniel Moorman ◽

...

Keyword(s):

Viral Infection ◽

Rna Sequencing ◽

Large Scale ◽

Splice Variants ◽

Human Herpesvirus ◽

Productive Infection ◽

Rna Seq ◽

Time Points ◽

Rna Transcripts ◽

Kinetic Class

ABSTRACT The “omics” revolution of recent years has simplified the study of RNA transcripts produced during viral infection and under specific defined conditions. In the quest to find new and differentially expressed transcripts during the course of human herpesvirus 6B (HHV-6B) infection, we made use of large-scale RNA sequencing to analyze the HHV-6B transcriptome during productive infection of human Molt-3 T cells. Analyses were performed at different time points following infection, and specific inhibitors were used to classify the kinetic class of each open reading frame (ORF) reported in the annotated genome of the HHV-6B Z29 strain. The initial search focused on HHV-6B-specific reads matching new HHV-6B transcripts. Differential expression of new HHV-6B transcripts was observed in all samples analyzed. The presence of many of these new HHV-6B transcripts was confirmed by reverse transcriptase PCR and Sanger sequencing. Many of these transcripts represented new splice variants of previously reported open reading frames (ORFs), including some transcripts that have yet to be defined. Overall, our work demonstrates the diversity and the complexity of the HHV-6B transcriptome. IMPORTANCE RNA sequencing (RNA-seq) is an important tool for studying RNA transcripts, particularly during active viral infection. We made use of RNA-seq to study human herpesvirus 6B (HHV-6B) infection. Using six different time points, we were able to identify the presence of differentially spliced genes at 6, 9, 12, 24, 48, and 72 h postinfection. Determination of the RNA profiles in the presence of cycloheximide (CHX) or phosphonoacetic acid (PAA) also permitted identification of the kinetic class of each ORF described in the annotated GenBank file. We also identified new spliced transcripts for certain genes and evaluated their relative expression over time. These data and next-generation sequencing (NGS) of the viral DNA have led us to propose a new version of the HHV-6B Z29 GenBank annotated file, without changing ORF names, to facilitate trace-back and correlate our work with previous studies on HHV-6B.

Download Full-text

HEDGEHOG/GLI Modulates the PRR11-SKA2 Bidirectional Transcription Unit in Lung Squamous Cell Carcinomas

Genes ◽

10.3390/genes12010120 ◽

2021 ◽

Vol 12 (1) ◽

pp. 120

Author(s):

Yiyun Sun ◽

Dandan Xu ◽

Chundong Zhang ◽

Yitao Wang ◽

Lian Zhang ◽

...

Keyword(s):

Rna Sequencing ◽

Squamous Cell ◽

Gene Pair ◽

Gene List ◽

Ectopic Expression ◽

Lung Squamous Cell Carcinoma ◽

Gene Set Enrichment Analysis ◽

The Cancer Genome Atlas ◽

Sequencing Data ◽

Gene Set

We previously demonstrated that proline-rich protein 11 (PRR11) and spindle and kinetochore associated 2 (SKA2) constituted a head-to-head gene pair driven by a prototypical bidirectional promoter. This gene pair synergistically promoted the development of non-small cell lung cancer. However, the signaling pathways leading to the ectopic expression of this gene pair remains obscure. In the present study, we first analyzed the lung squamous cell carcinoma (LSCC) relevant RNA sequencing data from The Cancer Genome Atlas (TCGA) database using the correlation analysis of gene expression and gene set enrichment analysis (GSEA), which revealed that the PRR11-SKA2 correlated gene list highly resembled the Hedgehog (Hh) pathway activation-related gene set. Subsequently, GLI1/2 inhibitor GANT-61 or GLI1/2-siRNA inhibited the Hh pathway of LSCC cells, concomitantly decreasing the expression levels of PRR11 and SKA2. Furthermore, the mRNA expression profile of LSCC cells treated with GANT-61 was detected using RNA sequencing, displaying 397 differentially expressed genes (203 upregulated genes and 194 downregulated genes). Out of them, one gene set, including BIRC5, NCAPG, CCNB2, and BUB1, was involved in cell division and interacted with both PRR11 and SKA2. These genes were verified as the downregulated genes via RT-PCR and their high expression significantly correlated with the shorter overall survival of LSCC patients. Taken together, our results indicate that GLI1/2 mediates the expression of the PRR11-SKA2-centric gene set that serves as an unfavorable prognostic indicator for LSCC patients, potentializing new combinatorial diagnostic and therapeutic strategies in LSCC.

Download Full-text

Mutational patterns and clonal evolution from diagnosis to relapse in pediatric acute lymphoblastic leukemia

Scientific Reports ◽

10.1038/s41598-021-95109-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Shumaila Sayyab ◽

Anders Lundmark ◽

Malin Larsson ◽

Markus Ringnér ◽

Sara Nystedt ◽

...

Keyword(s):

Acute Lymphoblastic Leukemia ◽

Large Scale ◽

Somatic Mutations ◽

Lymphoblastic Leukemia ◽

Clonal Evolution ◽

Point Mutations ◽

Driver Genes ◽

Protein Coding ◽

Pediatric Acute Lymphoblastic Leukemia ◽

Evolutionary Trajectories

AbstractThe mechanisms driving clonal heterogeneity and evolution in relapsed pediatric acute lymphoblastic leukemia (ALL) are not fully understood. We performed whole genome sequencing of samples collected at diagnosis, relapse(s) and remission from 29 Nordic patients. Somatic point mutations and large-scale structural variants were called using individually matched remission samples as controls, and allelic expression of the mutations was assessed in ALL cells using RNA-sequencing. We observed an increased burden of somatic mutations at relapse, compared to diagnosis, and at second relapse compared to first relapse. In addition to 29 known ALL driver genes, of which nine genes carried recurrent protein-coding mutations in our sample set, we identified putative non-protein coding mutations in regulatory regions of seven additional genes that have not previously been described in ALL. Cluster analysis of hundreds of somatic mutations per sample revealed three distinct evolutionary trajectories during ALL progression from diagnosis to relapse. The evolutionary trajectories provide insight into the mutational mechanisms leading relapse in ALL and could offer biomarkers for improved risk prediction in individual patients.

Download Full-text

Conserved long-range base pairings are associated with pre-mRNA processing of human genes

Nature Communications ◽

10.1038/s41467-021-22549-7 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Svetlana Kalmykova ◽

Marina Kalinina ◽

Stepan Denisov ◽

Alexey Mironov ◽

Dmitry Skvortsov ◽

...

Keyword(s):

Long Range ◽

Rna Folding ◽

Current Knowledge ◽

Rna Structures ◽

Base Pairs ◽

Protein Coding ◽

Proximity Ligation ◽

Transcriptional Suppression ◽

Human Genes ◽

Cleavage And Polyadenylation

AbstractThe ability of nucleic acids to form double-stranded structures is essential for all living systems on Earth. Current knowledge on functional RNA structures is focused on locally-occurring base pairs. However, crosslinking and proximity ligation experiments demonstrated that long-range RNA structures are highly abundant. Here, we present the most complete to-date catalog of conserved complementary regions (PCCRs) in human protein-coding genes. PCCRs tend to occur within introns, suppress intervening exons, and obstruct cryptic and inactive splice sites. Double-stranded structure of PCCRs is supported by decreased icSHAPE nucleotide accessibility, high abundance of RNA editing sites, and frequent occurrence of forked eCLIP peaks. Introns with PCCRs show a distinct splicing pattern in response to RNAPII slowdown suggesting that splicing is widely affected by co-transcriptional RNA folding. The enrichment of 3’-ends within PCCRs raises the intriguing hypothesis that coupling between RNA folding and splicing could mediate co-transcriptional suppression of premature pre-mRNA cleavage and polyadenylation.

Download Full-text

Disrupting upstream translation in mRNAs is associated with human disease

Nature Communications ◽

10.1038/s41467-021-21812-1 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

David S. M. Lee ◽

Joseph Park ◽

Andrew Kromer ◽

Aris Baras ◽

Daniel J. Rader ◽

...

Keyword(s):

Protein Expression ◽

Biological Significance ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Protein Coding ◽

Stop Codons ◽

Human Genes ◽

Strong Negative Selection ◽

Disease Associations ◽

Reading Frames

AbstractRibosome-profiling has uncovered pervasive translation in non-canonical open reading frames, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals, respectively, and elucidate their impact on protein expression in human cells. Our results suggest translation disrupting mechanisms relating uORF variation to reduced protein expression, and demonstrate that translation at uORFs is genetically constrained in 50% of human genes.

Download Full-text

Online database for brain cancer-implicated genes: exploring the subtype-specific mechanisms of brain cancer

BMC Genomics ◽

10.1186/s12864-021-07793-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Min Zhao ◽

Yining Liu ◽

Guiqiong Ding ◽

Dacheng Qu ◽

Hong Qu

Keyword(s):

Brain Cancer ◽

Differential Expression Analysis ◽

Survival Rates ◽

Brain Regions ◽

Solid Base ◽

Cancer Subtypes ◽

Human Genes ◽

Subtype Identification ◽

Genetic Mechanisms ◽

Gene Database

Abstract Background Brain cancer is one of the eight most common cancers occurring in people aged 40+ and is the fifth-leading cause of cancer-related deaths for males aged 40–59. Accurate subtype identification is crucial for precise therapeutic treatment, which largely depends on understanding the biological pathways and regulatory mechanisms associated with different brain cancer subtypes. Unfortunately, the subtype-implicated genes that have been identified are scattered in thousands of published studies. So, systematic literature curation and cross-validation could provide a solid base for comparative genetic studies about major subtypes. Results Here, we constructed a literature-based brain cancer gene database (BCGene). In the current release, we have a collection of 1421 unique human genes gathered through an extensive manual examination of over 6000 PubMed abstracts. We comprehensively annotated those curated genes to facilitate biological pathway identification, cancer genomic comparison, and differential expression analysis in various anatomical brain regions. By curating cancer subtypes from the literature, our database provides a basis for exploring the common and unique genetic mechanisms among 40 brain cancer subtypes. By further prioritizing the relative importance of those curated genes in the development of brain cancer, we identified 33 top-ranked genes with evidence mentioned only once in the literature, which were significantly associated with survival rates in a combined dataset of 2997 brain cancer cases. Conclusion BCGene provides a useful tool for exploring the genetic mechanisms of and gene priorities in brain cancer. BCGene is freely available to academic users at http://soft.bioinfo-minzhao.org/bcgene/.

Download Full-text

Genome-wide Analysis of Alternative Pre-mRNA Splicing

Journal of Biological Chemistry ◽

10.1074/jbc.r700033200 ◽

2007 ◽

Vol 283 (3) ◽

pp. 1229-1233 ◽

Cited By ~ 80

Author(s):

Claudia Ben-Dov ◽

Britta Hartmann ◽

Josefin Lundgren ◽

Juan Valcárcel

Keyword(s):

Alternative Splicing ◽

Large Scale ◽

Mrna Splicing ◽

Diagnostic Tools ◽

Primary Transcript ◽

Genome Wide ◽

Human Genes ◽

Multicellular Organisms ◽

Eukaryotic Genomes ◽

Key Questions

Alternative splicing of mRNA precursors allows the synthesis of multiple mRNAs from a single primary transcript, significantly expanding the information content and regulatory possibilities of higher eukaryotic genomes. High-throughput enabling technologies, particularly large-scale sequencing and splicing-sensitive microarrays, are providing unprecedented opportunities to address key questions in this field. The picture emerging from these pioneering studies is that alternative splicing affects most human genes and a significant fraction of the genes in other multicellular organisms, with the potential to greatly influence the evolution of complex genomes. A combinatorial code of regulatory signals and factors can deploy physiologically coherent programs of alternative splicing that are distinct from those regulated at other steps of gene expression. Pre-mRNA splicing and its regulation play important roles in human pathologies, and genome-wide analyses in this area are paving the way for improved diagnostic tools and for the identification of novel and more specific pharmaceutical targets.

Download Full-text