scholarly journals Getting the most out of RNA-seq data analysis

PeerJ ◽  
2015 ◽  
Vol 3 ◽  
pp. e1360 ◽  
Author(s):  
Tsung Fei Khang ◽  
Ching Yee Lau

Background.A common research goal in transcriptome projects is to find genes that are differentially expressed in different phenotype classes. Biologists might wish to validate such gene candidates experimentally, or use them for downstream systems biology analysis. Producing a coherent differential gene expression analysis from RNA-seq count data requires an understanding of how numerous sources of variation such as the replicate size, the hypothesized biological effect size, and the specific method for making differential expression calls interact. We believe an explicit demonstration of such interactions in real RNA-seq data sets is of practical interest to biologists.Results.Using two large public RNA-seq data sets—one representing strong, and another mild, biological effect size—we simulated different replicate size scenarios, and tested the performance of several commonly-used methods for calling differentially expressed genes in each of them. We found that, when biological effect size was mild, RNA-seq experiments should focus on experimental validation of differentially expressed gene candidates. Importantly, at least triplicates must be used, and the differentially expressed genes should be called using methods with high positive predictive value (PPV), such as NOISeq or GFOLD. In contrast, when biological effect size was strong, differentially expressed genes mined from unreplicated experiments using NOISeq, ASC and GFOLD had between 30 to 50% mean PPV, an increase of more than 30-fold compared to the cases of mild biological effect size. Among methods with good PPV performance, having triplicates or more substantially improved mean PPV to over 90% for GFOLD, 60% for DESeq2, 50% for NOISeq, and 30% for edgeR. At a replicate size of six, we found DESeq2 and edgeR to be reasonable methods for calling differentially expressed genes at systems level analysis, as their PPV and sensitivity trade-off were superior to the other methods’.Conclusion.When biological effect size is weak, systems level investigation is not possible using RNAseq data, and no meaningful result can be obtained in unreplicated experiments. Nonetheless, NOISeq or GFOLD may yield limited numbers of gene candidates with good validation potential, when triplicates or more are available. When biological effect size is strong, NOISeq and GFOLD are effective tools for detecting differentially expressed genes in unreplicated RNA-seq experiments for qPCR validation. When triplicates or more are available, GFOLD is a sharp tool for identifying high confidence differentially expressed genes for targeted qPCR validation; for downstream systems level analysis, combined results from DESeq2 and edgeR are useful.

2015 ◽  
Author(s):  
Tsung Fei Khang ◽  
Ching Yee Lau

Background: A common research goal in transcriptome projects is to find genes that are differentially expressed in different phenotype classes. Biologists might wish to validate such gene candidates experimentally or use them for downstream systems biology analysis. Producing a coherent differential expression analysis from RNA-seq count data requires an understanding of how numerous sources of variation such as the replicate size, the hypothesized biological effect, and the specific method for making differential expression calls interact. We believe an explicit demonstration of such interactions in real RNA-seq data sets is of practical interest to the biologist. Results: Using two large public RNA-seq data sets - one representing strong, and another mild, biological response, we simulated different replicate size scenarios and tested the performance of several commonly-used methods for calling differentially expressed genes in each of them. Our results suggest that if the biological response of interest in the different phenotype classes is expected to be mild, then RNA-seq experiments should focus on validation of differentially expressed gene candidates. At least triplicates must be used, and the differentially expressed genes should be called using methods with high positive predictive value such as NOISeq or GFOLD. In contrast, for strong biological response, differentially expressed genes mined from unreplicated experiments using NOISeq, ASC and GFOLD had between 30 to 50% mean positive predictive value, an increase of more than 30-fold compared to the case of mild biological response. Among methods with good positive predictive value performance, having triplicates or more substantially improved mean positive predictive value to over 90% for GFOLD, 60% for DESeq2, 50% for NOISeq, and 30% for edgeR. We found DESeq2 to be the most reasonable method to call differentially expressed genes for systems level analysis as it showed the best PPV and sensitivity trade-off (mean PPV and mean sensitivity ∼ 65% at replicate size of six). Conclusion: When biological effect size is strong, NOISeq and GFOLD are effective tools for detecting differentially expressed genes in unreplicated RNA-seq experiments for validation work. Having triplicates or more enables DESeq2 to detect sufficiently large numbers of reliable gene candidates for downstream systems level analysis. When biological effect size is weak, systems level investigation is not possible, and no meaningful result can be obtained in unreplicated experiments. Nonetheless, NOISeq or GFOLD may yield limited numbers of candidates with good validation potential when triplicates or more are available.


2015 ◽  
Author(s):  
Tsung Fei Khang ◽  
Ching Yee Lau

Background: A common research goal in transcriptome projects is to find genes that are differentially expressed in different phenotype classes. Biologists might wish to validate such gene candidates experimentally or use them for downstream systems biology analysis. Producing a coherent differential expression analysis from RNA-seq count data requires an understanding of how numerous sources of variation such as the replicate size, the hypothesized biological effect, and the specific method for making differential expression calls interact. We believe an explicit demonstration of such interactions in real RNA-seq data sets is of practical interest to the biologist. Results: Using two large public RNA-seq data sets - one representing strong, and another mild, biological response, we simulated different replicate size scenarios and tested the performance of several commonly-used methods for calling differentially expressed genes in each of them. Our results suggest that if the biological response of interest in the different phenotype classes is expected to be mild, then RNA-seq experiments should focus on validation of differentially expressed gene candidates. At least triplicates must be used, and the differentially expressed genes should be called using methods with high positive predictive value such as NOISeq or GFOLD. In contrast, for strong biological response, differentially expressed genes mined from unreplicated experiments using NOISeq, ASC and GFOLD had between 30 to 50% mean positive predictive value, an increase of more than 30-fold compared to the case of mild biological response. Among methods with good positive predictive value performance, having triplicates or more substantially improved mean positive predictive value to over 90% for GFOLD, 60% for DESeq2, 50% for NOISeq, and 30% for edgeR. We found DESeq2 to be the most reasonable method to call differentially expressed genes for systems level analysis as it showed the best PPV and sensitivity trade-off (mean PPV and mean sensitivity ∼ 65% at replicate size of six). Conclusion: When biological effect size is strong, NOISeq and GFOLD are effective tools for detecting differentially expressed genes in unreplicated RNA-seq experiments for validation work. Having triplicates or more enables DESeq2 to detect sufficiently large numbers of reliable gene candidates for downstream systems level analysis. When biological effect size is weak, systems level investigation is not possible, and no meaningful result can be obtained in unreplicated experiments. Nonetheless, NOISeq or GFOLD may yield limited numbers of candidates with good validation potential when triplicates or more are available.


Plants ◽  
2021 ◽  
Vol 10 (5) ◽  
pp. 1011
Author(s):  
Junping Xu ◽  
Chang Ho Ahn ◽  
Ju Young Shin ◽  
Pil Man Park ◽  
Hye Ryun An ◽  
...  

Toluene is an industrial raw material and solvent that can be found abundantly in our daily life products. The amount of toluene vapor is one of the most important measurements for evaluating air quality. The evaluation of toluene scavenging ability of different plants has been reported, but the mechanism of plant response to toluene is only partially understood. In this study, we performed RNA sequencing (RNA-seq) analysis to detect differential gene expression in toluene-treated and untreated leaves of Ardisiapusilla. A total of 88,444 unigenes were identified by RNA-seq analysis, of which 49,623 were successfully annotated and 4101 were differentially expressed. Gene ontology analysis revealed several subcategories of genes related to toluene response, including cell part, cellular process, organelle, and metabolic processes. We mapped the main metabolic pathways of genes related to toluene response and found that the differentially expressed genes were mainly involved in glycolysis/gluconeogenesis, starch and sucrose metabolism, glycerophospholipid metabolism, carotenoid biosynthesis, phenylpropanoid biosynthesis, and flavonoid biosynthesis. In addition, 53 transcription factors belonging to 13 transcription factor families were identified. We verified 10 differentially expressed genes related to metabolic pathways using quantitative real-time PCR and found that the results of RNA-seq were positively correlated with them, indicating that the transcriptome data were reliable. This study provides insights into the metabolic pathways involved in toluene response in plants.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Guoli Zhang ◽  
Zengqiang Zhao ◽  
Panpan Ma ◽  
Yanying Qu ◽  
Guoqing Sun ◽  
...  

AbstractWorldwide, Verticillium wilt is among the major harmful diseases in cotton production, causing substantial reduction in yields. While this disease has been extensively researched at the molecular level of the pathogen, the molecular basis of V. dahliae host response association is yet to be thoroughly investigated. In this study, RNA-seq analysis was carried out on V. dahliae infected two Gossypium hirsutum L. cultivars, Xinluzao-36 (susceptible) and Zhongzhimian-2 (disease resistant) for 0 h, 24 h, 72 h and 120 h time intervals. Statistical analysis revealed that V. dahliae infection elicited differentially expressed gene responses in the two cotton varieties, but more intensely in the susceptible cultivar than in the resistant cultivars. Data analysis revealed 4241 differentially expressed genes (DEGs) in the LT variety across the three treatment timepoints whereas 7657 in differentially expressed genes (DEGs) in the Vd592 variety across the three treatment timepoints. Six genes were randomly selected for qPCR validation of the RNA-Seq data. Numerous genes encompassed in disease resistance and defense mechanisms were identified. Further, RNA-Seq dataset was utilized in construction of the weighted gene co-expression network and 11 hub genes were identified, that encode for different proteins associated with lignin and immune response, Auxin response factor, cell wall and vascular development, microtubule, Ascorbate transporter, Serine/threonine kinase and Immunity and drought were identified. This significant research will aid in advancing crucial knowledge on virus-host interactions and identify key genes intricate in G. hirsutum L. resistance to V. dahliae infection.


Viruses ◽  
2021 ◽  
Vol 13 (2) ◽  
pp. 244 ◽  
Author(s):  
Antonio Victor Campos Coelho ◽  
Rossella Gratton ◽  
João Paulo Britto de Melo ◽  
José Leandro Andrade-Santos ◽  
Rafael Lima Guimarães ◽  
...  

HIV-1 infection elicits a complex dynamic of the expression various host genes. High throughput sequencing added an expressive amount of information regarding HIV-1 infections and pathogenesis. RNA sequencing (RNA-Seq) is currently the tool of choice to investigate gene expression in a several range of experimental setting. This study aims at performing a meta-analysis of RNA-Seq expression profiles in samples of HIV-1 infected CD4+ T cells compared to uninfected cells to assess consistently differentially expressed genes in the context of HIV-1 infection. We selected two studies (22 samples: 15 experimentally infected and 7 mock-infected). We found 208 differentially expressed genes in infected cells when compared to uninfected/mock-infected cells. This result had moderate overlap when compared to previous studies of HIV-1 infection transcriptomics, but we identified 64 genes already known to interact with HIV-1 according to the HIV-1 Human Interaction Database. A gene ontology (GO) analysis revealed enrichment of several pathways involved in immune response, cell adhesion, cell migration, inflammation, apoptosis, Wnt, Notch and ERK/MAPK signaling.


2019 ◽  
Vol 32 (5) ◽  
pp. 515-526 ◽  
Author(s):  
William E. Fry ◽  
Sean P. Patev ◽  
Kevin L. Myers ◽  
Kan Bao ◽  
Zhangjun Fei

Sporangia of Phytophthora infestans from pure cultures on agar plates are typically used in lab studies, whereas sporangia from leaflet lesions drive natural infections and epidemics. Multiple assays were performed to determine if sporangia from these two sources are equivalent. Sporangia from plate cultures showed much lower rates of indirect germination and produced much less disease in field and moist-chamber tests. This difference in aggressiveness was observed whether the sporangia had been previously incubated at 4°C (to induce indirect germination) or at 21°C (to prevent indirect germination). Furthermore, lesions caused by sporangia from plates produced much less sporulation. RNA-Seq analysis revealed that thousands of the >17,000 P. infestans genes with a RPKM (reads per kilobase of exon model per million mapped reads) >1 were differentially expressed in sporangia obtained from plate cultures of two independent field isolates compared with sporangia of those isolates from leaflet lesions. Among the significant differentially expressed genes (DEGs), putative RxLR effectors were overrepresented, with almost half of the 355 effectors with RPKM >1 being up- or downregulated. DEGs of both isolates include nine flagellar-associated genes, and all were down-regulated in plate sporangia. Ten elicitin genes were also detected as DEGs in both isolates, and nine (including INF1) were up-regulated in plate sporangia. These results corroborate previous observations that sporangia produced from plates and leaflets sometimes yield different experimental results and suggest hypotheses for potential mechanisms. We caution that use of plate sporangia in assays may not always produce results reflective of natural infections and epidemics.


2021 ◽  
Author(s):  
Chengang Guo ◽  
Zhimin wei ◽  
Wei Lyu ◽  
Yanlou Geng

Abstract Quinoa saponins have complex, diverse and evident physiologic activities. However, the key regulatory genes for quinoa saponin metabolism are not yet well studied. The purpose of this study was to explore genes closely related to quinoa saponin metabolism. In this study, the significantly differentially expressed genes in yellow quinoa were firstly screened based on RNA-seq technology. Then, the key genes for saponin metabolism were selected by gene set enrichment analysis (GSEA) and principal component analysis (PCA) statistical methods. Finally, the specificity of the key genes was verified by hierarchical clustering. The results of differential analysis showed that 1654 differentially expressed genes were achieved after pseudogenes deletion. Therein, there were 142 long non-coding genes and 1512 protein-coding genes. Based on GSEA analysis, 116 key candidate genes were found to be significantly correlated with quinoa saponin metabolism. Through PCA dimension reduction analysis, 57 key genes were finally obtained. Hierarchical cluster analysis further demonstrated that these key genes can clearly separate the four groups of samples. The present results could provide references for the breeding of sweet quinoa and would be helpful for the rational utilization of quinoa saponins.


2021 ◽  
Vol 8 ◽  
Author(s):  
Kirsten E. McLoughlin ◽  
Carolina N. Correia ◽  
John A. Browne ◽  
David A. Magee ◽  
Nicolas C. Nalpas ◽  
...  

Bovine tuberculosis, caused by infection with members of the Mycobacterium tuberculosis complex, particularly Mycobacterium bovis, is a major endemic disease affecting cattle populations worldwide, despite the implementation of stringent surveillance and control programs in many countries. The development of high-throughput functional genomics technologies, including RNA sequencing, has enabled detailed analysis of the host transcriptome to M. bovis infection, particularly at the macrophage and peripheral blood level. In the present study, we have analysed the transcriptome of bovine whole peripheral blood samples collected at −1 week pre-infection and +1, +2, +6, +10, and +12 weeks post-infection time points. Differentially expressed genes were catalogued and evaluated at each post-infection time point relative to the −1 week pre-infection time point and used for the identification of putative candidate host transcriptional biomarkers for M. bovis infection. Differentially expressed gene sets were also used for examination of cellular pathways associated with the host response to M. bovis infection, construction of de novo gene interaction networks enriched for host differentially expressed genes, and time-series analyses to identify functionally important groups of genes displaying similar patterns of expression across the infection time course. A notable outcome of these analyses was identification of a 19-gene transcriptional biosignature of infection consisting of genes increased in expression across the time course from +1 week to +12 weeks post-infection.


2021 ◽  
Author(s):  
Richard J White ◽  
Eirinn Mackay ◽  
Stephen W Wilson ◽  
Elisabeth M Busch-Nentwich

In model organisms, RNA sequencing is frequently used to assess the effect of genetic mutations on cellular and developmental processes. Typically, animals heterozygous for a mutation are crossed to produce offspring with different genotypes. Resultant embryos are grouped by genotype to compare homozygous mutant embryos to heterozygous and wild-type siblings. Genes that are differentially expressed between the groups are assumed to reveal insights into the pathways affected by the mutation. Here we show that in zebrafish, differentially expressed genes are often overrepresented on the same chromosome as the mutation due to different levels of expression of alleles from different genetic backgrounds. Using an incross of haplotype-resolved wild-type fish, we found evidence of widespread allele-specific expression, which appears as differential expression when comparing embryos homozygous for a region of the genome to their siblings. When analysing mutant transcriptomes, this means that differentially expressed genes on the same chromosome as a mutation of interest may not be caused by that mutation. Typically, the genomic location of a differentially expressed gene is not considered when interpreting its importance with respect to the phenotype. This could lead to pathways being erroneously implicated or overlooked due to the noise of spurious differentially expressed genes on the same chromosome as the mutation. These observations have implications for the interpretation of RNA-seq experiments involving outbred animals and non-inbred model organisms.


Sign in / Sign up

Export Citation Format

Share Document