The case for using Mapped Exonic Non-Duplicate (MEND) read counts in RNA-Seq experiments: examples from pediatric cancer datasets

Mapping Intimacies ◽

10.1101/716829 ◽

2019 ◽

Author(s):

Holly C. Beale ◽

Jacquelyn M. Roger ◽

Matthew A. Cattle ◽

Liam T. McKay ◽

Drew K. A. Thomson ◽

...

Keyword(s):

Gene Expression ◽

Pearson Correlation ◽

Rna Seq ◽

Technical Errors ◽

High Gene Expression ◽

Custom Script ◽

Sensitivity Studies ◽

High Gene ◽

Gene Expression Quantification ◽

Unmapped Reads

AbstractBackgroundThe accuracy of gene expression as measured by RNA sequencing (RNA-Seq) is dependent on the amount of sequencing performed. However, some types of reads are not informative for determining this accuracy. Unmapped and non-exonic reads do not contribute to gene expression quantification. Duplicate reads can be the product of high gene expression or technical errors.FindingsWe surveyed bulk RNA-Seq datasets from 2179 tumors in 48 cohorts to determine the fractions of uninformative reads. Total sequence depth was 0.2-668 million reads (median (med.) 61 million; interquartile range (IQR) 53 million). Unmapped reads constitute 1-77% of all reads (med. 3%; IQR 3%); duplicate reads constitute 3-100% of mapped reads (med. 27%; IQR 30%); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (med. 25%; IQR 21%). Informative reads--Mapped, Exonic, Non-duplicate (MEND) reads--constitute 0-79% of total reads (med. 50%; IQR 31%). Further, we find that MEND read counts have a 0.22 Pearson correlation to the number of genes expressed above 1 Transcript Per Million, while total reads have a correlation of −0.05.ConclusionsSince the fraction of uninformative reads vary, we propose using only definitively informative reads, MEND reads, for the purposes of asserting the accuracy of gene expression measured in a bulk RNA-Seq experiment. We provide a Docker image containing 1) the existing required tools (RSeQC, sambamba and samblaster) and 2) a custom script. We recommend that all results, sensitivity studies and depth recommendations use MEND units.

Download Full-text

The case for using mapped exonic non-duplicate reads when reporting RNA-sequencing depth: examples from pediatric cancer datasets

GigaScience ◽

10.1093/gigascience/giab011 ◽

2021 ◽

Vol 10 (3) ◽

Cited By ~ 1

Author(s):

Holly C Beale ◽

Jacquelyn M Roger ◽

Matthew A Cattle ◽

Liam T McKay ◽

Drew K A Thompson ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Expression Analysis ◽

Gene Expression Analysis ◽

Sequencing Depth ◽

Rna Seq ◽

Custom Script ◽

Sensitivity Studies ◽

Data Files ◽

Gene Expression Quantification

Abstract Background The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis. Findings In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1–77% of all reads (median [IQR], 3% [3–6%]); duplicate reads constitute 3–100% of mapped reads (median [IQR], 27% [13–43%]); and non-exonic reads constitute 4–97% of mapped, non-duplicate reads (median [IQR], 25% [16–37%]). MEND reads constitute 0–79% of total reads (median [IQR], 50% [30–61%]). Conclusions Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.

Download Full-text

Retained Heterodisomy Is Associated with High Gene Expression in Hyperhaploid Inflammatory Leiomyosarcoma

Neoplasia ◽

10.1593/neo.12930 ◽

2012 ◽

Vol 14 (9) ◽

pp. 807-IN5 ◽

Cited By ~ 15

Author(s):

Karolin H. Nord ◽

Kajsa Paulsson ◽

Srinivas Veerla ◽

Johan Wejde ◽

Otte Brosjö ◽

...

Keyword(s):

Gene Expression ◽

High Gene Expression ◽

High Gene

Download Full-text

High Gene Expression of CXCL8 Is Associated with the Presence of Extraintestinal Manifestations and Long-term Disease in Patients with Ulcerative Colitis

Inflammatory Bowel Diseases ◽

10.1002/ibd.22857 ◽

2013 ◽

Vol 19 (2) ◽

pp. E22-E23 ◽

Cited By ~ 4

Author(s):

Gabriela Fonseca-Camarillo ◽

Jesús K. Yamamoto-Furusho

Keyword(s):

Gene Expression ◽

Ulcerative Colitis ◽

Extraintestinal Manifestations ◽

High Gene Expression ◽

High Gene

Download Full-text

Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa068 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Yang Liao ◽

Wei Shi

Keyword(s):

Data Analysis ◽

Pearson Correlation ◽

Rna Seq ◽

Genome Wide ◽

Gene Level ◽

Sequencing Quality ◽

Total Data ◽

Order Of Magnitude ◽

Gene Expression Quantification ◽

The Impact

Abstract RNA sequencing (RNA-seq) is currently the standard method for genome-wide expression profiling. RNA-seq reads often need to be mapped to a reference genome before read counts can be produced for genes. Read trimming methods have been developed to assist read mapping by removing adapter sequences and low-sequencing-quality bases. It is however unclear what is the impact of read trimming on the quantification of RNA-seq data, an important task in RNA-seq data analysis. In this study, we used a benchmark RNA-seq dataset and simulation data to assess the impact of read trimming on mapping and quantification of RNA-seq reads. We found that adapter sequences can be effectively removed by read aligner via ’soft-clipping’ and that many low-sequencing-quality bases, which would be removed by read trimming tools, were rescued by the aligner. Accuracy of gene expression quantification from using untrimmed reads was found to be comparable to or slightly better than that from using trimmed reads, based on Pearson correlation with reverse transcriptase-polymerase chain reaction data and simulation truth. Total data analysis time was reduced by up to an order of magnitude when read trimming was not performed. Our study suggests that read trimming is a redundant process in the quantification of RNA-seq expression data.

Download Full-text

Molecular engineering of the salicylate-inducible transcription factor Sal7AR for orthogonal and high gene expression in Escherichia coli

PLoS ONE ◽

10.1371/journal.pone.0194090 ◽

2018 ◽

Vol 13 (4) ◽

pp. e0194090 ◽

Cited By ~ 4

Author(s):

Kentaro Miyazaki

Keyword(s):

Gene Expression ◽

Escherichia Coli ◽

Transcription Factor ◽

Molecular Engineering ◽

Expression In Escherichia Coli ◽

High Gene Expression ◽

High Gene

Download Full-text

Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensivein silicoassessment of RNA-seq experiments

Molecular Ecology ◽

10.1111/mec.12014 ◽

2012 ◽

Vol 22 (3) ◽

pp. 620-634 ◽

Cited By ~ 167

Author(s):

Nagarjun Vijay ◽

Jelmer W. Poelstra ◽

Axel Künstner ◽

Jochen B. W. Wolf

Keyword(s):

Gene Expression ◽

Differential Gene Expression ◽

Transcriptome Assembly ◽

Rna Seq ◽

Gene Expression Quantification ◽

Differential Gene ◽

Expression Quantification ◽

Challenges And Strategies

Download Full-text

The effect of human genome annotation complexity on RNA-Seq gene expression quantification

2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops ◽

10.1109/bibmw.2012.6470224 ◽

2012 ◽

Cited By ~ 3

Author(s):

Po-Yen Wu ◽

John H. Phan ◽

May D. Wang

Keyword(s):

Gene Expression ◽

Human Genome ◽

Genome Annotation ◽

Rna Seq ◽

Gene Expression Quantification ◽

Expression Quantification

Download Full-text

Impact of Gene Annotation Choice on the Quantification of RNA-Seq Data

10.21203/rs.3.rs-421080/v1 ◽

2021 ◽

Author(s):

David Chisanga ◽

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Refseq Gene ◽

Rna Seq ◽

Sequencing Data ◽

Microarray Expression Data ◽

Sequencing Quality ◽

Gene Expression Quantification ◽

Expression Quantification

Abstract Background: RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis.Results: In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.Conclusion: In conclusion, our study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation. We also found that, surprisingly, the recent expansion of the RefSeq database, which was primarily driven by the incorporation of sequencing data into the gene annotation process, resulted in a reduction in the accuracy of RNA-seq quantification.

Download Full-text

Impact of gene annotation choice on the quantification of RNA-seq data

10.1101/2021.01.07.425794 ◽

2021 ◽

Author(s):

David Chisanga ◽

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Rna Seq ◽

Microarray Expression Data ◽

Refseq Annotation ◽

Sequencing Quality ◽

Gene Expression Quantification ◽

Microarray Expression ◽

Expression Quantification

RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis. In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from $>$800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.

Download Full-text

A strong promoter, PMagpd, provides a tool for high gene expression in entomopathogenic fungus, Metarhizium acridum

Biotechnology Letters ◽

10.1007/s10529-011-0805-3 ◽

2011 ◽

Vol 34 (3) ◽

pp. 557-562 ◽

Cited By ~ 9

Author(s):

Yueqing Cao ◽

Run Jiao ◽

Yuxian Xia

Keyword(s):

Gene Expression ◽

Entomopathogenic Fungus ◽

Metarhizium Acridum ◽

Strong Promoter ◽

High Gene Expression ◽

High Gene

Download Full-text