Impact of Gene Annotation on RNA-seq Data Analysis

Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis

Human Genomics ◽

10.1186/s40246-021-00336-1 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Zeeshan Ahmed ◽

Eduard Gibert Renart ◽

Saman Zeeshan ◽

XinQi Dong

Keyword(s):

Data Analysis ◽

Patient Care ◽

Expression Analysis ◽

High Throughput ◽

Gene Annotation ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Complex Disorders ◽

Transcriptomics Data

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.

Download Full-text

Expression and Co-expression Analyses of WRKY, MYB, bHLH and bZIP Transcription Factor Genes in Potato (Solanum tuberosum) Under Abiotic Stress Conditions: RNA-seq Data Analysis

Potato Research ◽

10.1007/s11540-021-09502-3 ◽

2021 ◽

Author(s):

Ertugrul Filiz ◽

Firat Kurt

Keyword(s):

Transcription Factor ◽

Abiotic Stress ◽

Solanum Tuberosum ◽

Data Analysis ◽

Stress Conditions ◽

Bzip Transcription Factor ◽

Rna Seq ◽

Transcription Factor Genes

Download Full-text

RcTGA1 and glucosinolate biosynthesis pathway involvement in the defence of rose against the necrotrophic fungus Botrytis cinerea

BMC Plant Biology ◽

10.1186/s12870-021-02973-z ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Penghua Gao ◽

Hao Zhang ◽

Huijun Yan ◽

Qigang Wang ◽

Bo Yan ◽

...

Keyword(s):

Disease Resistance ◽

Gene Annotation ◽

Infected Plant ◽

Grey Mould ◽

Cascade Reactions ◽

Postharvest Quality ◽

Rna Seq ◽

Protein Activity ◽

Phenylpropanoid Biosynthesis ◽

And Cluster Analysis

Abstract Background Rose is an important economic crop in horticulture. However, its field growth and postharvest quality are negatively affected by grey mould disease caused by Botrytis c. However, it is unclear how rose plants defend themselves against this fungal pathogen. Here, we used transcriptomic, metabolomic and VIGS analyses to explore the mechanism of resistance to Botrytis c. Result In this study, a protein activity analysis revealed a significant increase in defence enzyme activities in infected plants. RNA-Seq of plants infected for 0 h, 36 h, 60 h and 72 h produced a total of 54 GB of clean reads. Among these reads, 3990, 5995 and 8683 differentially expressed genes (DEGs) were found in CK vs. T36, CK vs. T60 and CK vs. T72, respectively. Gene annotation and cluster analysis of the DEGs revealed a variety of defence responses to Botrytis c. infection, including resistance (R) proteins, MAPK cascade reactions, plant hormone signal transduction pathways, plant-pathogen interaction pathways, Ca2+ and disease resistance-related genes. qPCR verification showed the reliability of the transcriptome data. The PTRV2-RcTGA1-infected plant material showed improved susceptibility of rose to Botrytis c. A total of 635 metabolites were detected in all samples, which could be divided into 29 groups. Metabonomic data showed that a total of 59, 78 and 74 DEMs were obtained for T36, T60 and T72 (T36: Botrytis c. inoculated rose flowers at 36 h; T60: Botrytis c. inoculated rose flowers at 60 h; T72: Botrytis c. inoculated rose flowers at 72 h) compared to CK, respectively. A variety of secondary metabolites are related to biological disease resistance, including tannins, amino acids and derivatives, and alkaloids, among others; they were significantly increased and enriched in phenylpropanoid biosynthesis, glucosinolates and other disease resistance pathways. This study provides a theoretical basis for breeding new cultivars that are resistant to Botrytis c. Conclusion Fifty-four GB of clean reads were generated through RNA-Seq. R proteins, ROS signalling, Ca2+ signalling, MAPK signalling, and SA signalling were activated in the Old Blush response to Botrytis c. RcTGA1 positively regulates rose resistance to Botrytis c. A total of 635 metabolites were detected in all samples. DEMs were enriched in phenylpropanoid biosynthesis, glucosinolates and other disease resistance pathways.

Download Full-text

A mutation in LacDWARF1 results in a GA-deficient dwarf phenotype in sponge gourd (Luffa acutangula)

Theoretical and Applied Genetics ◽

10.1007/s00122-021-03938-4 ◽

2021 ◽

Author(s):

Gangjun Zhao ◽

Caixia Luo ◽

Jianning Luo ◽

Junxing Li ◽

Hao Gong ◽

...

Keyword(s):

Gene Annotation ◽

Recessive Gene ◽

Genomic Region ◽

Dwarf Mutant ◽

Rna Seq ◽

Dwarf Phenotype ◽

Sponge Gourd ◽

Response To Stress ◽

Luffa Acutangula ◽

Generation Sequencing

Abstract Key message A dwarfism gene LacDWARF1 was mapped by combined BSA-Seq and comparative genomics analyses to a 65.4 kb physical genomic region on chromosome 05. Abstract Dwarf architecture is one of the most important traits utilized in Cucurbitaceae breeding because it saves labor and increases the harvest index. To our knowledge, there has been no prior research about dwarfism in the sponge gourd. This study reports the first dwarf mutant WJ209 with a decrease in cell size and internodes. A genetic analysis revealed that the mutant phenotype was controlled by a single recessive gene, which is designated Lacdwarf1 (Lacd1). Combined with bulked segregate analysis and next-generation sequencing, we quickly mapped a 65.4 kb region on chromosome 5 using F2 segregation population with InDel and SNP polymorphism markers. Gene annotation revealed that Lac05g019500 encodes a gibberellin 3β-hydroxylase (GA3ox) that functions as the most likely candidate gene for Lacd1. DNA sequence analysis showed that there is an approximately 4 kb insertion in the first intron of Lac05g019500 in WJ209. Lac05g019500 is transcribed incorrectly in the dwarf mutant owing to the presence of the insertion. Moreover, the bioactive GAs decreased significantly in WJ209, and the dwarf phenotype could be restored by exogenous GA3 treatment, indicating that WJ209 is a GA-deficient mutant. All these results support the conclusion that Lac05g019500 is the Lacd1 gene. In addition, RNA-Seq revealed that many genes, including those related to plant hormones, cellular process, cell wall, membrane and response to stress, were significantly altered in WJ209 compared with the wild type. This study will aid in the use of molecular marker-assisted breeding in the dwarf sponge gourd.

Download Full-text

Impact of RNA-seq data analysis algorithms on gene expression estimation and downstream prediction

Scientific Reports ◽

10.1038/s41598-020-74567-y ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Li Tong ◽

◽

Po-Yen Wu ◽

John H. Phan ◽

Hamid R. Hassazadeh ◽

...

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Disease Outcome ◽

Rna Seq ◽

Next Generation Sequencing Technology ◽

Normalization Methods ◽

The Us ◽

Sequencing Quality ◽

Improved Accuracy ◽

The Impact

Abstract To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline’s performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.

Download Full-text

BioJupies: Automated Generation of Interactive Notebooks for RNA-Seq Data Analysis in the Cloud

Cell Systems ◽

10.1016/j.cels.2018.10.007 ◽

2018 ◽

Vol 7 (5) ◽

pp. 556-561.e3 ◽

Cited By ~ 61

Author(s):

Denis Torre ◽

Alexander Lachmann ◽

Avi Ma’ayan

Keyword(s):

Data Analysis ◽

Rna Seq ◽

Automated Generation

Download Full-text

Maximizing prediction of orphan genes in assembled genomes

10.1101/2019.12.17.880294 ◽

2019 ◽

Cited By ~ 2

Author(s):

Arun Seetharam ◽

Urminder Singh ◽

Jing Li ◽

Priyanka Bhandary ◽

Zeb Arendsee ◽

...

Keyword(s):

Sequence Homology ◽

Evolutionary History ◽

Direct Evidence ◽

Gene Annotation ◽

Rna Seq ◽

Orphan Genes ◽

Orphan Gene ◽

New Genes ◽

Conserved Genes ◽

Rapid Emergence

ABSTRACTThe evolutionary rapid emergence of new genes gives rise to “orphan genes” that share no sequence homology to genes in closely related genomes. These genes provide organisms with a reservoir of genetic elements to quickly respond to changing selection pressures. Gene annotation pipelines that combine ab initio machine-learning with sequence homology-based searches are efficient in identifying basal genes with a long evolutionary history. However, their ability to identify orphan genes and other young genes has not been systematically evaluated. Here, we classify the phylostrata of curated Arabidopsis thaliana genes and use these to assess the ability of two of the most prevalent annotation pipelines, MAKER and BRAKER, to predict orphans and other young genes. MAKER predictions are highly dependent on the RNA-Seq evidence, predicting between 11% and 60% of the orphan-genes and 95% to 98% of basal-genes in the annotated genome of Arabidopsis. In contrast, BRAKER consistently predicts 33% of orphan-genes and 98% of basal-genes. A less used method to identify genes is by directly aligning RNA-Seq data to the genome sequence. We present a Findable, Accessible, Interoperable and Reusable (FAIR) approach, called BIND, that mitigates the under-prediction of orphan genes. BIND combines BRAKER predictions with direct evidence-based inference of transcripts based on RNA-Seq alignments to the genome. BIND increases the number and accuracy of orphan gene predictions, identifying 68% of Araport11-annotated orphan genes and 99% of the conserved genes.

Download Full-text

RNA-Seq Data Analysis (Bowtie-TopHat-Cufflinks) v3 (protocols.io.x9qfr5w)

protocols.io ◽

10.17504/protocols.io.x9qfr5w ◽

2019 ◽

Author(s):

Kiichi Hirota

Keyword(s):

Data Analysis ◽

Rna Seq

Download Full-text

TrancriptomeReconstructoR, A Data-Driven Annotation of Complex Transcriptomes

10.21203/rs.3.rs-131404/v1 ◽

2020 ◽

Author(s):

Maxim Ivanov ◽

Albin Sandelin ◽

Sebastian Marquardt

Keyword(s):

De Novo ◽

Gene Annotation ◽

R Package ◽

Sequence Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Model ◽

Preparation Methods ◽

Downstream Analysis

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.

Download Full-text

A Survey of Bioinformatics-Based Tools in RNA-Sequencing (RNA-Seq) Data Analysis

Translational Bioinformatics and Its Application - Translational Medicine Research ◽

10.1007/978-94-024-1045-7_10 ◽

2017 ◽

pp. 223-248 ◽

Cited By ~ 1

Author(s):

Pallavi Gaur ◽

Anoop Chaturvedi

Keyword(s):

Data Analysis ◽

Rna Sequencing ◽

Rna Seq

Download Full-text