scholarly journals Analyses of cancer data in the Genomic Data Commons Data Portal with new functionalities in the TCGAbiolinks R/Bioconductor package

2018 ◽  
Author(s):  
Mohamed Mounir ◽  
Tiago C. Silva ◽  
Marta Lucchetta ◽  
Catharina Olsen ◽  
Gianluca Bontempi ◽  
...  

ABSTRACTThe advent of Next Generation Sequencing (NGS) technologies has opened new perspectives in deciphering the genetic mechanisms underlying complex diseases. Nowadays, the amount of genomic data is massive and substantial efforts and new tools are required to unveil the information hidden in the data.The Genomic Data Commons (GDC) Data Portal is a large data collection platform that includes different genomic studies included the ones from The Cancer Genome Atlas (TCGA) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiatives, accounting for more than 40 tumor types originating from nearly 30000 patients. Such platforms, although very attractive, must make sure the stored data are easily accessible and adequately harmonized. Moreover, they have the primary focus on the data storage in a unique place, and they do not provide a comprehensive toolkit for analyses and interpretation of the data. To fulfill this urgent need, comprehensive but easily accessible computational methods for integrative analyses of genomic data without renouncing a robust statistical and theoretical framework are needed. In this context, the R/Bioconductor package TCGAbiolinks was developed, offering a variety of bioinformatics functionalities. Here we introduce new features and enhancements of TCGAbiolinks in terms of i) more accurate and flexible pipelines for differential expression analyses, ii) different methods for tumor purity estimation and filtering, iii) integration of normal samples from the Genotype-Tissue-Expression (GTEx) platform iv) support for other genomics datasets, here exemplified by the TARGET data.Evidence has shown that accounting for tumor purity is essential in the study of tumorigenesis, as these factors promote confounding behavior regarding differential expression analysis. Henceforth, we implemented these filtering procedures in TCGAbiolinks. Moreover, a limitation of some of the TCGA datasets is the unavailability or paucity of corresponding normal samples. We thus integrated into TCGAbiolinks the possibility to use normal samples from the Genotype-Tissue Expression (GTEx) project, which is another large-scale repository cataloging gene expression from healthy individuals. The new functionalities are available in the TCGABiolinks v 2.8 and higher released in Bioconductor version 3.7.

2021 ◽  
Author(s):  
Florian Jeanneret ◽  
Stephane Gazut

The advent of high-throughput techniques has greatly enhanced biological discovery. Last years, analysis of multi-omics data has taken the front seat to improve physiological understanding. Handling functional enrichment results from various biological data raises practical questions. We propose an integrative workflow to better interpret biological process insights in a multi-omics approach applied to breast cancer data from The Cancer Genome Atlas (TCGA) related to Invasive Ductal Carcinoma (IDC) and Invasive Lobular Carcinoma (ILC). Pathway enrichment by Over Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) has been conducted with both features information from differential expression analysis or selected features from multi-block sPLS-DA methods. Then, comprehensive comparisons of enrichment results have been carried out by looking at classical enrichment analysis, probabilities pooling by Stouffer's Z scores method and pathways clustering in biological themes. Our work shows that ORA enrichment with selected sPLS-DA features and pathways probabilities pooling by Stouffer's method lead to enrichment maps highly associated to physiological knowledge of IDC or ILC phenotypes, better than ORA and GSEA with differential expression driven features.


Author(s):  
Tiara Bunga Mayang Permata ◽  
Sri Mutya Sekarutami ◽  
Endang Nuryadi ◽  
Angela Giselvania ◽  
Soehartati Gondhowiardjo

In the current big data era, massive genomic cancer data are available for open access from anywhere in the world. They are obtained from popular platforms, such as The Cancer Genome Atlas, which provides genetic information from clinical samples, and Cancer Cell Line Encyclopedia, which offers genomic data of cancer cell lines. For convenient analysis, user-friendly tools, such as the Tumor Immune Estimation Resource (TIMER), which can be used to analyze tumor-infiltrating immune cells comprehensively, are also emerging. In clinical practice, clinical sequencing has been recommended for patients with cancer in many countries. Despite its many challenges, it enables the application of precision medicine, especially in medical oncology. In this review, several efforts devoted to accomplishing precision oncology and applying big data for use in Indonesia are discussed. Utilizing open access genomic data in writing research articles is also described.


2015 ◽  
Author(s):  
Leonardo Collado-Torres ◽  
Abhinav Nellore ◽  
Alyssa C. Frazee ◽  
Christopher Wilks ◽  
Michael I. Love ◽  
...  

AbstractBackgroundDifferential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.ResultsWe present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (1) implementing a computationally efficient bump-hunting approach to identify DERs which permits genome-scale analyses in a large number of samples, (2) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (3) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.Conclusionsderfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis.The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.


2019 ◽  
Author(s):  
Zhenyu Zhang ◽  
Kyle Hernandez ◽  
Jeremiah Savage ◽  
Shenglai Li ◽  
Dan Miller ◽  
...  

AbstractThe goal of the National Cancer Institute (NCI) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive (https://gdc.cancer.gov/).


2019 ◽  
Author(s):  
Bowen Liu ◽  
Xiaofei Yang ◽  
Tingjie Wang ◽  
Jiadong Lin ◽  
Yongyong Kang ◽  
...  

Abstract Motivation Tumor purity is a fundamental property of each cancer sample and affects downstream investigations. Current tumor purity estimation methods either require matched normal sample or report moderately high tumor purity even on normal samples. It is critical to develop a novel computational approach to estimate tumor purity with sufficient precision based on tumor-only sample. Results In this study, we developed MEpurity, a beta mixture model-based algorithm, to estimate the tumor purity based on tumor-only Illumina Infinium 450k methylation microarray data. We applied MEpurity to both The Cancer Genome Atlas (TCGA) cancer data and cancer cell line data, demonstrating that MEpurity reports low tumor purity on normal samples and comparable results on tumor samples with other state-of-art methods. Availability and implementation MEpurity is a C++ program which is available at https://github.com/xjtu-omics/MEpurity. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Melvyn Yap ◽  
Rebecca L. Johnston ◽  
Helena Foley ◽  
Samual MacDonald ◽  
Olga Kondrashova ◽  
...  

AbstractFor complex machine learning (ML) algorithms to gain widespread acceptance in decision making, we must be able to identify the features driving the predictions. Explainability models allow transparency of ML algorithms, however their reliability within high-dimensional data is unclear. To test the reliability of the explainability model SHapley Additive exPlanations (SHAP), we developed a convolutional neural network to predict tissue classification from Genotype-Tissue Expression (GTEx) RNA-seq data representing 16,651 samples from 47 tissues. Our classifier achieved an average F1 score of 96.1% on held-out GTEx samples. Using SHAP values, we identified the 2423 most discriminatory genes, of which 98.6% were also identified by differential expression analysis across all tissues. The SHAP genes reflected expected biological processes involved in tissue differentiation and function. Moreover, SHAP genes clustered tissue types with superior performance when compared to all genes, genes detected by differential expression analysis, or random genes. We demonstrate the utility and reliability of SHAP to explain a deep learning model and highlight the strengths of applying ML to transcriptome data.


2018 ◽  
Author(s):  
Paul F. Harrison ◽  
Andrew D. Pattison ◽  
David R. Powell ◽  
Traude H. Beilharz

AbstractBackgroundA differential gene expression analysis may produce a set of significantly differentially expressed genes that is too large to easily investigate, so that a means of ranking genes by their biological interest level is desirable. The life-sciences have grappled with the abuse of p-values to rank genes for this purpose. As an alternative, a lower confidence bound on the magnitude of Log Fold Change (LFC) could be used to rank genes, but it has been unclear how to reconcile this with the need to perform False Discovery Rate (FDR) correction. The TREAT test of McCarthy and Smyth is a step in this direction, finding genes significantly exceeding a specified LFC threshold. Here we describe the use of test inversion on TREAT to present genes ranked by a confidence bound on the LFC, while still controlling FDR.ResultsTesting the Topconfects R package with simulated gene expression data shows the method outperforming current statistical approaches across a wide range of experiment sizes in the identification of genes with largest LFCs. Applying the method to a TCGA breast cancer data-set shows the method ranks some genes with large LFC higher than would traditional ranking by p-value. Importantly these two ranking methods lead to a different biological emphasis, in terms both of specific highly ranked genes and gene-set enrichment.ConclusionsThe choice of ranking method in differential expression analysis can affect the biological interpretation. The common default of ranking by p-value is implicitly by an effect size in which each gene is standardized to its own variability, rather than comparing genes on a common scale, which may not be appropriate. The Topconfects approach of presenting genes ranked by confident LFC effect size is a variation on the TREAT method with improved usability, removing the need to fine-tune a threshold parameter and removing the temptation to abuse p-values as a de-facto effect size.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Zhenyu Zhang ◽  
Kyle Hernandez ◽  
Jeremiah Savage ◽  
Shenglai Li ◽  
Dan Miller ◽  
...  

AbstractThe goal of the National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive (https://gdc.cancer.gov/).


Sign in / Sign up

Export Citation Format

Share Document