scholarly journals Framework for reanalysis of publicly available Affymetrix® GeneChip® data sets based on functional regions of interest

2017 ◽  
Author(s):  
Ernur Saka ◽  
Benjamin J. Harrison ◽  
Kirk West ◽  
Jeffrey C. Petruska ◽  
Eric C. Rouchka

AbstractBackgroundSince the introduction of microarrays in 1995, researchers world-wide have used both commercial and custom-designed microarrays for understanding differential expression of transcribed genes. Public databases such as ArrayExpress and the Gene Expression Omnibus (GEO) have made millions of samples readily available. One main drawback to microarray data analysis involves the selection of probes to represent a specific transcript of interest, particularly in light of the fact that transcript-specific knowledge (notably alternative splicing) is dynamic in nature.ResultsWe therefore developed a framework for reannotating and reassigning probe groups for Affymetrix® GeneChip® technology based on functional regions of interest. This framework addresses three issues of Affymetrix® GeneChip® data analyses: removing nonspecific probes, updating probe target mapping based on the latest genome knowledge and grouping probes into gene, transcript and region-based (UTR, individual exon, CDS) probe sets. Updated gene and transcript probe sets provide more specific analysis results based on current genomic and transcriptomic knowledge. The framework selects unique probes, aligns them to gene annotations and generates a custom Chip Description File (CDF). The analysis reveals only 87% of the Affymetrix® GeneChip® HG-U133 Plus 2 probes uniquely align to the current hg38 human assembly without mismatches. We also tested new mappings on the publicly available data series using rat and human data from GSE48611 and GSE72551 obtained from GEO, and illustrate that functional grouping allows for the subtle detection of regions of interest likely to have phenotypical consequences.ConclusionThrough reanalysis of the publicly available data series GSE48611 and GSE72551, we profiled the contribution of UTR and CDS regions to the gene expression levels globally. The comparison between region and gene based results indicated that the detected expressed genes by gene-based and region-based CDFs show high consistency and regions based results allows us to detection of changes in transcript formation.

2014 ◽  
Vol 2014 ◽  
pp. 1-8
Author(s):  
Tzu-Hao Chang ◽  
Shih-Lin Wu ◽  
Wei-Jen Wang ◽  
Jorng-Tzong Horng ◽  
Cheng-Wei Chang

Microarrays are widely used to assess gene expressions. Most microarray studies focus primarily on identifying differential gene expressions between conditions (e.g., cancer versus normal cells), for discovering the major factors that cause diseases. Because previous studies have not identified the correlations of differential gene expression between conditions, crucial but abnormal regulations that cause diseases might have been disregarded. This paper proposes an approach for discovering the condition-specific correlations of gene expressions within biological pathways. Because analyzing gene expression correlations is time consuming, an Apache Hadoop cloud computing platform was implemented. Three microarray data sets of breast cancer were collected from the Gene Expression Omnibus, and pathway information from the Kyoto Encyclopedia of Genes and Genomes was applied for discovering meaningful biological correlations. The results showed that adopting the Hadoop platform considerably decreased the computation time. Several correlations of differential gene expressions were discovered between the relapse and nonrelapse breast cancer samples, and most of them were involved in cancer regulation and cancer-related pathways. The results showed that breast cancer recurrence might be highly associated with the abnormal regulations of these gene pairs, rather than with their individual expression levels. The proposed method was computationally efficient and reliable, and stable results were obtained when different data sets were used. The proposed method is effective in identifying meaningful biological regulation patterns between conditions.


2017 ◽  
Author(s):  
Djordje Djordjevic ◽  
Joshua Y. S. Tang ◽  
Yun Xin Chen ◽  
Shu Lun Shannon Kwan ◽  
Raymond W. K. Ling ◽  
...  

AbstractThere exists over 2.5 million publicly available gene expression samples across 101,000 data series in NCBI’s Gene Expression Omnibus (GEO) database. Due to the lack of the use of standardised ontology terms in GEO’s free text metadata to annotate the experimental type and sample type, this database remains difficult to harness computationally without significant manual intervention.In this work, we present an interactive R/Shiny tool called GEOracle that utilises text mining and machine learning techniques to automatically identify perturbation experiments, group treatment and control samples and perform differential expression. We present applications of GEOracle to discover conserved signalling pathway target genes and identify an organ specific gene regulatory network.GEOracle is effective in discovering perturbation gene targets in GEO by harnessing its free text metadata. Its effectiveness and applicability has been demonstrated by cross validation and two real-life case studies. It opens up new avenues to unlock the gene regulatory information embedded inside large biological databases such as GEO. GEOracle is available at https://github.com/VCCRI/GEOracle.


2020 ◽  
Author(s):  
Zichen Jiao ◽  
Ao Yu ◽  
Xiaofeng He ◽  
Yulong Xuan ◽  
He Zhang ◽  
...  

Abstract Objective MiRNAs are considered to be crucial for NSCLC’s initiation and development. MiRNAs have been widely identified in NSCLC. However, the role of miR-126 in NSCLC has not been fully explained.Methods miR-126 Expression in NSCLC was evaluated by analyzing the common data sets in Gene Expression Omnibus(GEO) database and reviewing former thesis papers. Three mRNA datasets, GSE18842, GSE19804 and GSE101929, from GEO to indentify the differentially expressed genes (DEG). We prognosed the target genes of hsa-miR-126-5p using TargetScan and analyzed the gene overlap between the target genes of miR-126 and DEG in NSCLC. Subsequently, we analyzed Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. We used STRING and Cytoscape to construct a protein-protein interaction (PPI) network, and analyzed the influence of HUB gene on the prognosis of NSCLC.Results A common pattern of mir-126 downregulation in NSCLC was identified in the literature review. A total of 187 DEGs were identified, both NSCLC-related and miR-126-related. Many DEGs are extendedly enriched in cell membranes, signal receptor binding, and biological regulation. Among the 10 main Hub genes analyzed by PPI, 4 HUB genes (NCAP-G,MELK,KIAA0101,TPX2) were obviously related to the poor recuperation of NSCLC patients. When these genes highly expressed, survival rate of NSCLC patients was low. Furthermore, we identified the recessive miR-126-related genes that may be involved in NSCLC, such as TPX2, HMMR, and ANLN through network analysis.Conclusion this study suggests that mir-126 is radical for the biological processing of NSCLC.


2020 ◽  
Vol 9 (25) ◽  
Author(s):  
Kevin S. Myers ◽  
Michael Place ◽  
Daniel R. Noguera ◽  
Timothy J. Donohue

ABSTRACT We introduce COnTORT (COmprehensive Transcriptomic ORganizational Tool), a publicly available program that retrieves all available gene expression data and associated metadata for an organism from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database. The data are compiled into text files that can be used for downstream bioinformatic applications.


2020 ◽  
Vol 4 (20) ◽  
pp. 5322-5335
Author(s):  
Ali Nehme ◽  
Hassan Dakik ◽  
Frédéric Picou ◽  
Meyling Cheok ◽  
Claude Preudhomme ◽  
...  

Abstract Advances in transcriptomics have improved our understanding of leukemic development and helped to enhance the stratification of patients. The tendency of transcriptomic studies to combine AML samples, regardless of cytogenetic abnormalities, could lead to bias in differential gene expression analysis because of the differential representation of AML subgroups. Hence, we performed a horizontal meta-analysis that integrated transcriptomic data on AML from multiple studies, to enrich the less frequent cytogenetic subgroups and to uncover common genes involved in the development of AML and response to therapy. A total of 28 Affymetrix microarray data sets containing 3940 AML samples were downloaded from the Gene Expression Omnibus database. After stringent quality control, transcriptomic data on 1534 samples from 11 data sets, covering 10 AML cytogenetically defined subgroups, were retained and merged with the data on 198 healthy bone marrow samples. Differentially expressed genes between each cytogenetic subgroup and normal samples were extracted, enabling the unbiased identification of 330 commonly deregulated genes (CODEGs), which showed enriched profiles of myeloid differentiation, leukemic stem cell status, and relapse. Most of these genes were downregulated, in accordance with DNA hypermethylation. CODEGs were then used to create a prognostic score based on the weighted sum of expression of 22 core genes (CODEG22). The score was validated with microarray data of 5 independent cohorts and by quantitative real time-polymerase chain reaction in a cohort of 142 samples. CODEG22-based stratification of patients, globally and into subpopulations of cytologically healthy and elderly individuals, may complement the European LeukemiaNet classification, for a more accurate prediction of AML outcomes.


4open ◽  
2018 ◽  
Vol 1 ◽  
pp. 4
Author(s):  
Bibhu Prasad Parida ◽  
Biswapriya Biswavas Misra ◽  
Amarendra Narayan Misra

Introduction: Aging is a complex biological process that brings about a gradual decline of physiological and metabolic machineries as a result of maturity. Also, aging is irreversible and leads ultimately to death in biological organisms. Methods: We intend to characterize aging at the gene expression level using publicly available human gene expression arrays obtained from gene expression omnibus (GEO) and ArrayExpress. Candidate genes were identified by rigorous screening using filtered data sets, i.e., GSE11882, GSE47881, and GSE32719. Using Aroma and Limma packages, we selected the top 200 genes showing up and down regulation (p < 0.05 and fold change >2.5) out of which 185 were chosen for further comparative analysis. Results: This investigation enabled identification of candidate genes involved in aging that are associated with several signaling cascades demonstrating strong correlation with ATP binding and protease functions. Conclusion: A majority of these gene encoded proteins function extracellularly, and also provide insights into the immunopathological basis of aging.


2021 ◽  
Author(s):  
Bincheng Ren ◽  
Kaini He ◽  
Miao Yuan ◽  
Yu Wang ◽  
Yuanyuan Tie ◽  
...  

Abstract Background: The pathogenic mechanism and development of the diabetic cardiomyopathy(DCM) has been generally explained, and it is clear that the microRNAs(miRNAs), mRNAs and transcription factors(TFs) participate in the process of the DCM disease. Yet, the hub targets of the disease progression are not clear.Methods: To figure out the problem, we downloaded data sets from the Gene Expression Omnibus(GEO) database (GSE44179 and GSE4745). The targeted mRNAs of miRNAs were downloaded from TargetScan, miRBD and microT-CDS database. Gene Ontology (GO) enrichment of miRNAs and mRNAs were analysed in DAVID.R studio software was used to visualize the results of screened targets and GO enrichment. Cytoscape software was used to visualize the miRNA-mRNA-TF interaction network and calculate the hub targets. Results: We filtered eight miRNAs, nine mRNAs and ten transcription factors(TFs) by bioinformatics analysis, and constructed a miRNA-mRNA-TF network. The top ten degrees of nodes in the network are rno-miR-7a, Hnf4a, rno-miR-17, rno-miR-21, rno-miR-122, rno-miR-200c, Med1, Mlxipl, SP1 and rno-miR-34a, which were closely related to the process of DCM. Conclusion: This study revealed that rno-miR-7a, Hnf4a, rno-miR-17and rno-miR-21 may play vital role in the progress of diabetic cardiomyopathy.


2019 ◽  
Author(s):  
Gregory R. Gershkowitz ◽  
Zachary B. Abrams ◽  
Caitlin E. Coombes ◽  
Kevin R. Coombes

AbstractBackgroundResearchers commonly use online tools such as ToppGene to conduct enrichment analyses on gene expression data. This process does not easily allow multiple gene data sets to be analyzed and compared at once. ToppGene requires the user to manually enter gene symbols or other gene identifiers into a text box and to manually sift through forms with many adjustable parameters in order to obtain a downloadable text file of results. This process makes the analysis of multiple sets of genes tedious, time-consuming, and error prone. To address this problem, we developed Malachite, a Python package that enables researchers to perform gene enrichment analyses on multiple gene lists and concatenate the resulting enrichment statistics. In this way, Malachite enables meta-enrichment analyses across multiple data sets.ResultsTo illustrate its use, we applied Malachite to three data sets from the Gene Expression Omnibus comparing gene expression in the large airways of smokers and non-smokers. Biological processes enriched in all three data sets were related to xenobiotic stimulus; molecular functions typically involved nicotinamide adenine dinucleotide phosphate (NADP) activity.ConclusionMalachite enables researchers to automate gene enrichment metaanalyses using ToppGene. Malachite also enhances ToppGene’s gene set analysis of drug-gene relationships by further filtering for FDA approved drugs.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Matthew N. Bernstein ◽  
Zijian Ni ◽  
Michael Collins ◽  
Mark E. Burkard ◽  
Christina Kendziorski ◽  
...  

Abstract Background Single-cell RNA-seq (scRNA-seq) enables the profiling of genome-wide gene expression at the single-cell level and in so doing facilitates insight into and information about cellular heterogeneity within a tissue. This is especially important in cancer, where tumor and tumor microenvironment heterogeneity directly impact development, maintenance, and progression of disease. While publicly available scRNA-seq cancer data sets offer unprecedented opportunity to better understand the mechanisms underlying tumor progression, metastasis, drug resistance, and immune evasion, much of the available information has been underutilized, in part, due to the lack of tools available for aggregating and analysing these data. Results We present CHARacterizing Tumor Subpopulations (CHARTS), a web application for exploring publicly available scRNA-seq cancer data sets in the NCBI’s Gene Expression Omnibus. More specifically, CHARTS enables the exploration of individual gene expression, cell type, malignancy-status, differentially expressed genes, and gene set enrichment results in subpopulations of cells across tumors and data sets. Along with the web application, we also make available the backend computational pipeline that was used to produce the analyses that are available for exploration in the web application. Conclusion CHARTS is an easy to use, comprehensive platform for exploring single-cell subpopulations within tumors across the ever-growing collection of public scRNA-seq cancer data sets. CHARTS is freely available at charts.morgridge.org.


Sign in / Sign up

Export Citation Format

Share Document