High-quality rice RNA-seq-based co-expression network for predicting gene function and regulation

Mapping Intimacies ◽

10.1101/138040 ◽

2017 ◽

Cited By ~ 1

Author(s):

Hua Yu ◽

Bingke Jiao ◽

Chengzhi Liang

Keyword(s):

Large Scale ◽

Network Inference ◽

Agronomic Traits ◽

Enrichment Analysis ◽

Circular Rna ◽

Rna Seq ◽

Sequencing Data ◽

High Quality ◽

Crop Species ◽

Functional Link

AbstractInferring the genome-scale gene co-expression network is important for understanding genetic architecture underlying the complex and various biological phenotypes. The recent availability of large-scale RNA-seq sequencing-data provides great potential for co-expression network inference. In this study, for the first time, we presented a novel heterogeneous ensemble pipeline integrating three frequently used inference methods, to build a high-quality RNA-seq-based Gene Co-expression Network (GCN) in rice, an important monocot species. The quality of the network obtained by our proposed method was first evaluated and verified with the curated positive and negative gene functional link datasets, which obviously outperformed each single method. Secondly, the powerful capability of this network for associating unknown genes with biological functions and agronomic traits was showed by enrichment analysis and case studies. Particularly, we demonstrated the potential applications of our proposed method to predict the biological roles of long non-coding RNA (lncRNA) and circular RNA (circRNA) genes. Our results provided a valuable data source for selecting candidate genes to further experimental validation during rice genetics research and breeding. To enhance identification of novel genes regulating important biological processes and agronomic traits in rice and other crop species, we released the source code of constructing high-quality RNA-seq-based GCN and rice RNA-seq-based GCN, which can be freely downloaded online at https://github.com/czllab/NetMiner.

Download Full-text

Gene Expression Profile in Similar Tissues Using Transcriptome Sequencing Data of Whole-Body Horse Skeletal Muscle

Genes ◽

10.3390/genes11111359 ◽

2020 ◽

Vol 11 (11) ◽

pp. 1359

Author(s):

Ho-Yeon Lee ◽

Jae-Yoon Kim ◽

Kyoung Hyoun Kim ◽

Seongmun Jeong ◽

Youngbum Cho ◽

...

Keyword(s):

Enrichment Analysis ◽

Gene Set Enrichment Analysis ◽

Whole Body ◽

Rna Seq ◽

Sequencing Data ◽

Exercise Adaptation ◽

Wide Range ◽

Metabolic Properties ◽

Transcriptome Expression ◽

Functional Pathway

Horses have been studied for exercise function rather than food production, unlike most livestock. Therefore, the role and characteristics of tissue landscapes are critically understudied, except for certain muscles used in exercise-related studies. In the present study, we compared RNA-Seq data from 18 Jeju horse skeletal muscles to identify differentially expressed genes (DEGs) between tissues that have similar functions and to characterize these differences. We identified DEGs between different muscles using pairwise differential expression (DE) analyses of tissue transcriptome expression data and classified the samples using the expression values of those genes. Each tissue was largely classified into two groups and their subgroups by k-means clustering, and the DEGs identified in comparison between each group were analyzed by functional/pathway level using gene set enrichment analysis and gene level, confirming the expression of significant genes. As a result of the analysis, the differences in metabolic properties like glycolysis, oxidative phosphorylation, and exercise adaptation of the groups were detected. The results demonstrated that the biochemical and anatomical features of a wide range of muscle tissues in horses could be determined through transcriptome expression analysis, and provided proof-of-concept data demonstrating that RNA-Seq analysis can be used to classify and study in-depth differences between tissues with similar properties.

Download Full-text

SSCC: a novel computational framework for rapid and accurate clustering large single cell RNA-seq data

10.1101/344242 ◽

2018 ◽

Cited By ~ 2

Author(s):

Xianwen Ren ◽

Liangtao Zheng ◽

Zemin Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Random Projection ◽

Rna Seq ◽

Sequencing Data ◽

Computational Framework ◽

Human Blood Cells ◽

Single Cell Rna Sequencing ◽

Data Volume

ABSTRACTClustering is a prevalent analytical means to analyze single cell RNA sequencing data but the rapidly expanding data volume can make this process computational challenging. New methods for both accurate and efficient clustering are of pressing needs. Here we proposed a new clustering framework based on random projection and feature construction for large scale single-cell RNA sequencing data, which greatly improves clustering accuracy, robustness and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells, our method reached 20% improvements for clustering accuracy and 50-fold acceleration but only consumed 66% memory usage compared to the widely-used software package SC3. Compared to k-means, the accuracy improvement can reach 3-fold depending on the concrete dataset. An R implementation of the framework is available from https://github.com/Japrin/sscClust.

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

Circall: fast and accurate methodology for discovery of circular RNAs from paired-end RNA-sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04418-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dat Thanh Nguyen ◽

Quang Thinh Trac ◽

Thi-Hau Nguyen ◽

Ha-Nam Nguyen ◽

Nir Ohad ◽

...

Keyword(s):

Rna Sequencing ◽

Simulated Data ◽

High Sensitivity ◽

Circular Rna ◽

Computational Time ◽

Circular Rnas ◽

Rna Seq ◽

Sequencing Data ◽

Mapping Algorithm ◽

False Discovery Rate Method

Abstract Background Circular RNA (circRNA) is an emerging class of RNA molecules attracting researchers due to its potential for serving as markers for diagnosis, prognosis, or therapeutic targets of cancer, cardiovascular, and autoimmune diseases. Current methods for detection of circRNA from RNA sequencing (RNA-seq) focus mostly on improving mapping quality of reads supporting the back-splicing junction (BSJ) of a circRNA to eliminate false positives (FPs). We show that mapping information alone often cannot predict if a BSJ-supporting read is derived from a true circRNA or not, thus increasing the rate of FP circRNAs. Results We have developed Circall, a novel circRNA detection method from RNA-seq. Circall controls the FPs using a robust multidimensional local false discovery rate method based on the length and expression of circRNAs. It is computationally highly efficient by using a quasi-mapping algorithm for fast and accurate RNA read alignments. We applied Circall on two simulated datasets and three experimental datasets of human cell-lines. The results show that Circall achieves high sensitivity and precision in the simulated data. In the experimental datasets it performs well against current leading methods. Circall is also substantially faster than the other methods, particularly for large datasets. Conclusions With those better performances in the detection of circRNAs and in computational time, Circall facilitates the analyses of circRNAs in large numbers of samples. Circall is implemented in C++ and R, and available for use at https://www.meb.ki.se/sites/biostatwiki/circall and https://github.com/datngu/Circall.

Download Full-text

Identification and characterization of circRNAs in the deposition of intramuscular fat in Aohan fine-wool sheep

10.21203/rs.3.rs-81001/v1 ◽

2020 ◽

Author(s):

Le Zhao ◽

Nan Liu ◽

Fuhui Han ◽

Lisheng Zhou ◽

Lirong Liu ◽

...

Keyword(s):

Interaction Analysis ◽

Age Groups ◽

Wnt Signaling Pathway ◽

Enrichment Analysis ◽

Circular Rna ◽

Intramuscular Fat ◽

Sheep Breed ◽

Differentially Expressed ◽

Sphingolipid Metabolism ◽

Rna Seq

Abstract Background Aohan fine-wool sheep (AFWS) is a high-quality fine-wool sheep breed that supplies both wool and meat. The quality of its meat is affected by many factors. Research is needed on the molecular mechanism of intramuscular fat (IMF) growth, which greatly improves mutton quality. The widely expressed non-coding RNA is used in roles such as competitive endogenous RNAs (ceRNAs), including microRNAs (miRNAs). Although circular RNA (circRNA) was studied in many fields, little research was devoted to IMF in sheep. We used RNA-Seq to analyze tissues associated with IMF in 2-month-old and 12-month-old AFWS rams to understand the role of circRNA in the growth and development of sheep IMF. Results A total of 11,565 candidate circRNAs were identified, of which 104 were differentially expressed in the two age groups. We analyzed these differentially expressed circRNAs. Enrichment analysis was performed using Gene Ontology and the Kyoto Encyclopedia of Genes and Genomes. The enriched pathways included lipid transport (GO:0006869), negative regulation of canonical Wnt signaling pathway (GO:0090090), fat digestion and absorption (ko04975), and sphingolipid metabolism (ko00600). We used the TargetScan and miRanda software programs for interaction analysis, and a network diagram was created. Six circRNAs were randomly selected and verified the RNA-Seq results by quantitative real-time PCR. Conclusion This study provides more information on circRNA regulation in AFWS, and is a useful resource for further research on this sheep breed.

Download Full-text

Quality Assessment of Domesticated Animal Genome Assemblies

Bioinformatics and Biology Insights ◽

10.4137/bbi.s29333 ◽

2015 ◽

Vol 9S4 ◽

pp. BBI.S29333 ◽

Cited By ~ 3

Author(s):

Stefan E. Seemann ◽

Christian Anthon ◽

Oana Palasca ◽

Jan Gorodkin

Keyword(s):

High Throughput Sequencing ◽

Genomic Sequence ◽

Rna Seq ◽

Sequencing Data ◽

Assembly Quality ◽

High Quality ◽

Rnaseq Data ◽

Genome Assemblies ◽

Animal Genomes

The era of high-throughput sequencing has made it relatively simple to sequence genomes and transcriptomes of individuals from many species. In order to analyze the resulting sequencing data, high-quality reference genome assemblies are required. However, this is still a major challenge, and many domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNA seq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data and genome assembly quality. We rank the genomes by their assembly quality and discuss the implications for genotype analyses.

Download Full-text

Profiling Analysis of Circular RNA and mRNA in Human Temporal Lobe Epilepsy with Hippocampal Sclerosis ILAE Type 1

10.21203/rs.3.rs-331427/v2 ◽

2021 ◽

Author(s):

Yifei Gu ◽

Hongmei Wu ◽

Tianyu Wang ◽

Shengkun Yu ◽

Zhibin Han ◽

...

Keyword(s):

Temporal Lobe Epilepsy ◽

Temporal Lobe ◽

Hippocampal Sclerosis ◽

Enrichment Analysis ◽

Chloride Ion ◽

Circular Rna ◽

Sequencing Data ◽

Vesicle Membrane ◽

Transcriptional Changes

Abstract Hippocampal sclerosis (HS) is the most common surgical pathology associated with temporal lobe epilepsy (TLE). However, the cause of TLE with or without HS remains unknown. Our current study aimed to illustrate the essential molecular mechanism that is potentially involved in the pathogenesis of TLE-HS and to shed light on the transcriptional changes associated with hippocampal sclerosis. Compared to no-HS group, 341 mRNA transcripts and 131 circRNA transcripts were differentially expressed in ILAE type 1 group. The raw sequencing data have been deposited into sequence read archive (SRA) database under accession number PRJNA699348.Gene Ontology analysis demonstrated that the dysregulated genes were associated with the biological processes of vesicle-mediated transport. Enrichment analysis demonstrated that dysregulated genes were involved mainly in the MAPK signal pathway. Subsequently, A total of 441 known or predicted interactions were formed among DEGs, and the most important module was detected in the PPI network using the MCODE plug-in. There were mainly four functional modules enriched: ER to Golgi transport vesicle membrane, Basal transcription factors, GABA-gated chloride ion channel activity, CENP-A containing nucleosome assembly. A circRNA-mRNA co-expression network was constructed including 5 circRNAs(hsa_circ_0025349, hsa_circ_0002405, hsa_circ_0004805, hsa_circ_0032254, and hsa_circ_0032875) and three mRNAs (FYN, SELENBP1, and GRIPAP1) based on the normalized mRNA signal intensities. This is the first to report the circRNAs and mRNAs expression profile of surgically resected hippocampal tissues from TLE patients of ILAE-1 and no-HS, and these results may provide new insight into the transcriptional changes associated with this pathology.

Download Full-text

Improving the diagnostic yield of exome-sequencing, by predicting gene-phenotype associations using large-scale gene expression analysis

10.1101/375766 ◽

2018 ◽

Cited By ~ 4

Author(s):

Patrick Deelen ◽

Sipko van Dam ◽

Johanna C. Herkert ◽

Juha M. Karjalainen ◽

Harm Brugge ◽

...

Keyword(s):

Gene Expression ◽

Large Scale ◽

Gene Expression Analysis ◽

Diagnostic Yield ◽

Genetic Diagnosis ◽

Added Value ◽

Disease Genes ◽

Rna Seq ◽

Sequencing Data ◽

Causative Gene

AbstractClinical interpretation of exome and genome sequencing data remains challenging and time consuming, with many variants with unknown effects found in genes with unknown functions. Automated prioritization of these variants can improve the speed of current diagnostics and identify previously unknown disease genes. Here, we used 31,499 RNA-seq samples to predict the phenotypic consequences of variants in genes. We developed GeneNetwork Assisted Diagnostic Optimization (GADO), a tool that uses these predictions in combination with a patient’s phenotype, denoted using HPO terms, to prioritize identified variants and ease interpretation. GADO is unique because it does not rely on existing knowledge of a gene and can therefore prioritize variants missed by tools that rely on existing annotations or pathway membership. In a validation trial on patients with a known genetic diagnosis, GADO prioritized the causative gene within the top 3 for 41% of the cases. Applying GADO to a cohort of 38 patients without genetic diagnosis, yielded new candidate genes for seven cases. Our results highlight the added value of GADO (www.genenetwork.nl) for increasing diagnostic yield and for implicating previously unknown disease-causing genes.

Download Full-text

Reducing INDEL calling errors in whole-genome and exome sequencing data

10.1101/006148 ◽

2014 ◽

Cited By ~ 2

Author(s):

Han Fang ◽

Yiyang Wu ◽

Giuseppe Narzisi ◽

Jason A. O'Rawe ◽

Laura T. Jimenez Barrón ◽

...

Keyword(s):

Exome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Published Data ◽

Whole Genome ◽

Sequencing Data ◽

High Quality ◽

Indel Detection ◽

Validation Experiment ◽

Large Indels

BackgroundINDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts.MethodsWe characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low quality INDELs (7% vs. 51%).ResultsSimulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (52%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (85% vs. 54%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data.ConclusionsOverall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (e.g. capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing.

Download Full-text

Optimal balancing of clinical factors in large scale clinical RNA-Seq studies

10.1101/2021.06.30.450639 ◽

2021 ◽

Author(s):

Austin W.T. CHIANG ◽

Vahid H Gazestani ◽

Mia G. Altieri ◽

Eric Courchesne ◽

Nathan E. Lewis

Keyword(s):

Large Scale ◽

Sample Selection ◽

Empirical Support ◽

Internal Validity ◽

Enrichment Analysis ◽

Superior Performance ◽

Rna Seq ◽

Clinical Factors ◽

Batch Correction ◽

Post Hoc

Omics technologies are ubiquitous in biomedical research. However, improper sample selection is an often-overlooked complication with large omics studies, resulting in confounding effects that can disrupt the internal validity of a study and lead to false conclusions. Here, we present a method called BalanceIT, which uses a genetic algorithm to identify an optimal set of samples with balanced clinical factors for large-scale omics experiments. We apply our approach to two large RNA-Seq studies in autism (1) to find a post-hoc balanced sample set among an imbalanced study, and (2) to design an optimal study that allows for efficient batch correction. Our approach leads to near-perfect estimates of differential gene expression, superior performance of pathway-level enrichment analysis, and consistent network dysregulation patterns of autism symptom severity. These results provide empirical support for the importance of balanced experimental design, and BalanceIT will be invaluable for large-scale study design and batch effect correction.

Download Full-text