Epiviz File Server: Query, Transform and Interactively Explore Data from Indexed Genomic Files

Mapping Intimacies ◽

10.1101/865295 ◽

2019 ◽

Author(s):

Jayaram Kancherla ◽

Yifan Yang ◽

Hyeyun Chae ◽

Hector Corrada Bravo

Keyword(s):

Data Analysis ◽

Genomic Data ◽

Public Access ◽

The Cancer Genome Atlas ◽

File Server ◽

Data Repositories ◽

In Situ Data ◽

Dna Elements ◽

Cancer Genome Atlas ◽

Abstract Interface

AbstractGenomic data repositories like The Cancer Genome Atlas (TCGA), Encyclopedia of DNA Elements (ENCODE), Bioconductor’s AnnotationHub and ExperimentHub etc., provide public access to large amounts of genomic data as flat files. Researchers often download a subset of files data from these repositories to perform their data analysis. As these data repositories become larger, researchers often face bottlenecks in their exploratory data analysis. Based on the concepts of a NoDB paradigm, we developed epivizFileServer, a Python library that implements an in-situ data query system for local or remotely hosted indexed genomic files, not only for visualization but also data manipulation. The File Server library decouples data from analysis workflows and provides an abstract interface to define computations independent of the location, format or structure of the file.

Download Full-text

Epiviz File Server: Query, transform and interactively explore data from indexed genomic files

Bioinformatics ◽

10.1093/bioinformatics/btaa591 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4682-4690 ◽

Cited By ~ 1

Author(s):

Jayaram Kancherla ◽

Yifan Yang ◽

Hyeyun Chae ◽

Hector Corrada Bravo

Keyword(s):

Genomic Data ◽

Data Retrieval ◽

Public Access ◽

The Cancer Genome Atlas ◽

File Server ◽

Data Repositories ◽

In Situ Data ◽

Dna Elements ◽

Data Files ◽

Abstract Interface

Abstract Motivation Genomic data repositories like The Cancer Genome Atlas, Encyclopedia of DNA Elements, Bioconductor’s AnnotationHub and ExperimentHub etc., provide public access to large amounts of genomic data as flat files. Researchers often download a subset of data files from these repositories to perform exploratory data analysis. We developed Epiviz File Server, a Python library that implements an in situ data query system for local or remotely hosted indexed genomic files, not only for visualization but also data transformation. The File Server library decouples data retrieval and transformation from specific visualization and analysis tools and provides an abstract interface to define computations independent of the location, format or structure of the file. We demonstrate the File Server in two use cases: (i) integration with Galaxy workflows and (ii) using Epiviz to create a custom genome browser from the Epigenome Roadmap dataset. Availability and implementation Epiviz File Server is open source and is available on GitHub at http://github.com/epiviz/epivizFileServer. The documentation for the File Server library is available at http://epivizfileserver.rtfd.io.

Download Full-text

Extending TCGA queries to automatically identify analogous genomic data from dbGaP

F1000Research ◽

10.12688/f1000research.9837.1 ◽

2017 ◽

Vol 6 ◽

pp. 319

Author(s):

Erin K. Wagner ◽

Satyajeet Raje ◽

Liz Amos ◽

Jessica Kurata ◽

Abhijit S. Badve ◽

...

Keyword(s):

Genomic Data ◽

The Cancer Genome Atlas ◽

Genomic Research ◽

Reproducible Research ◽

Software Pipeline ◽

Individual Level ◽

Related Data ◽

Cancer Genome Atlas ◽

Existing Data ◽

Genome Atlas

Data sharing is critical to advance genomic research by reducing the demand to collect new data by reusing and combining existing data and by promoting reproducible research. The Cancer Genome Atlas (TCGA) is a popular resource for individual-level genotype-phenotype cancer related data. The Database of Genotypes and Phenotypes (dbGaP) contains many datasets similar to those in TCGA. We have created a software pipeline that will allow researchers to discover relevant genomic data from dbGaP, based on matching TCGA metadata. The resulting research provides an easy to use tool to connect these two data sources.

Download Full-text

Genomic Common Data Model for Seamless Interoperation of Biomedical Data in Clinical Practice: Retrospective Study (Preprint)

10.2196/preprints.13249 ◽

2018 ◽

Author(s):

Seo Jeong Shin ◽

Seng Chan You ◽

Yu Rang Park ◽

Jin Roh ◽

Jang-Hee Kim ◽

...

Keyword(s):

Clinical Practice ◽

Human Genome ◽

Genomic Data ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Common Data Model ◽

Sequencing Data ◽

School Of Medicine ◽

Cancer Genome Atlas ◽

Genome Atlas

BACKGROUND Clinical sequencing data should be shared in order to achieve the sufficient scale and diversity required to provide strong evidence for improving patient care. A distributed research network allows researchers to share this evidence rather than the patient-level data across centers, thereby avoiding privacy issues. The Observational Medical Outcomes Partnership (OMOP) common data model (CDM) used in distributed research networks has low coverage of sequencing data and does not reflect the latest trends of precision medicine. OBJECTIVE The aim of this study was to develop and evaluate the feasibility of a genomic CDM (G-CDM), as an extension of the OMOP-CDM, for application of genomic data in clinical practice. METHODS Existing genomic data models and sequencing reports were reviewed to extend the OMOP-CDM to cover genomic data. The Human Genome Organisation Gene Nomenclature Committee and Human Genome Variation Society nomenclature were adopted to standardize the terminology in the model. Sequencing data of 114 and 1060 patients with lung cancer were obtained from the Ajou University School of Medicine database of Ajou University Hospital and The Cancer Genome Atlas, respectively, which were transformed to a format appropriate for the G-CDM. The data were compared with respect to gene name, variant type, and actionable mutations. RESULTS The G-CDM was extended into four tables linked to tables of the OMOP-CDM. Upon comparison with The Cancer Genome Atlas data, a clinically actionable mutation, p.Leu858Arg, in the EGFR gene was 6.64 times more frequent in the Ajou University School of Medicine database, while the p.Gly12Xaa mutation in the KRAS gene was 2.02 times more frequent in The Cancer Genome Atlas dataset. The data-exploring tool GeneProfiler was further developed to conduct descriptive analyses automatically using the G-CDM, which provides the proportions of genes, variant types, and actionable mutations. GeneProfiler also allows for querying the specific gene name and Human Genome Variation Society nomenclature to calculate the proportion of patients with a given mutation. CONCLUSIONS We developed the G-CDM for effective integration of genomic data with standardized clinical data, allowing for data sharing across institutes. The feasibility of the G-CDM was validated by assessing the differences in data characteristics between two different genomic databases through the proposed data-exploring tool GeneProfiler. The G-CDM may facilitate analyses of interoperating clinical and genomic datasets across multiple institutions, minimizing privacy issues and enabling researchers to better understand the characteristics of patients and promote personalized medicine in clinical practice.

Download Full-text

Abstract 236: Identification of novel cancer target genes by combining data from the cancer genome-wide association studies (GWAS), regulatory DNA elements and The Cancer Genome Atlas (TCGA)

10.1158/1538-7445.am2018-236 ◽

2018 ◽

Author(s):

Diptee A. Kulkarni ◽

Karl Guo ◽

Junping Jing ◽

Mugdha Khaladkar ◽

Kijoung Song ◽

...

Keyword(s):

Target Genes ◽

Association Studies ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Combining Data ◽

Dna Elements ◽

Cancer Genome Atlas ◽

Regulatory Dna

Download Full-text

Exploring cancer genomic data from the cancer genome atlas project

BMB Reports ◽

10.5483/bmbrep.2016.49.11.145 ◽

2016 ◽

Vol 49 (11) ◽

pp. 607-611 ◽

Cited By ~ 25

Author(s):

Ju-Seog Lee

Keyword(s):

Genomic Data ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Cancer Genome Atlas ◽

Genome Atlas

Download Full-text

Expression of the embryonic morphogen Nodal in differentiated thyroid carcinomas: Immunohistochemistry assay in tissue microarray and The Cancer Genome Atlas data analysis

Surgery ◽

10.1016/j.surg.2014.08.050 ◽

2014 ◽

Vol 156 (6) ◽

pp. 1559-1568 ◽

Cited By ~ 3

Author(s):

Young Jun Chai ◽

Young A. Kim ◽

Hyeon-Gun Jee ◽

Jin Wook Yi ◽

Bo Gun Jang ◽

...

Keyword(s):

Data Analysis ◽

Tissue Microarray ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Thyroid Carcinomas ◽

Atlas Data ◽

Cancer Genome Atlas ◽

Genome Atlas

Download Full-text

A new unsupervised clustering algorithm applied to genome-wide profiles of breast cancers in The Cancer Genome Atlas proper subsets triple-negative samples.

Journal of Clinical Oncology ◽

10.1200/jco.2017.35.15_suppl.e23195 ◽

2017 ◽

Vol 35 (15_suppl) ◽

pp. e23195-e23195

Author(s):

Jason Mezey ◽

Steven Schwager ◽

Sushila Shenoy ◽

Jef Benbanaste ◽

Michael Elashoff ◽

...

Keyword(s):

Clustering Algorithm ◽

Genomic Data ◽

Cancer Genome ◽

Proper Subset ◽

The Cancer Genome Atlas ◽

Driver Mutations ◽

Genome Wide ◽

A Genome ◽

Cancer Genome Atlas ◽

Genome Atlas

e23195 Background: Clustering algorithms have identified subtypes of major cancers from analysis of genome-wide gene expression (GE) and somatic mutation (SM) profiles. These algorithms almost never discover a proper subset cluster, a recovered cluster that includes all the samples of a specific subtype. For breast cancer (BC), clustering of genome-wide profiles has been unable to proper subset triple negatives (TNs), TN subtypes, or other major subtypes. Methods: To search for a proper subset cluster for TNs, we applied a new clustering algorithm to the public domain GE and SM data of BC samples in The Cancer Genome Atlas (TCGA). A module of Medidata’s Clinical Trial Genomics (CTG) platform for automated clinical and genomic data integration and analysis, it uses a hierarchical component with tree learned cut points applied to a principal component dimension reduced similarity matrix calculated from a genome-wide data profile. Results: Our analysis of 540 TCGA BC samples run without human supervision produced a proper subset cluster that included all 55 TN samples and only 74 non-TN samples. GE data have previously indicated TN status, but this is the first demonstration that these TCGA BC data contain enough information to proper subset TNs, implying that this broad BC subtype has a strong, quantifiable impact on GE. We show that the genome-wide SMs of TCGA BC samples can be used to proper subset 4 novel subtypes distinguished as classes “TP53 mutated”, “PIK3CA mutated”, “both TP53 and PIK3CA mutated”, and “neither mutated”, signifying an important role for these known driver mutations in producing the subtypes’ genome-wide mutation profiles. We find that most ( > 80%) TN BCs are in “TP53 mutated” but only 1 TN sample ( < 2%) is in “PIK3CA mutated”, indicating distinct biology for these TNs with potential implications for TN therapy. Conclusions: CTG clustering achieves proper subset cancer subtype clustering of TCGA BC samples. These results illustrate the therapeutic discovery potential possible from genomic data of the high quality present in TCGA if combined with detailed clinical data with the Medidata CTG integration and annotation platform.

Download Full-text

Glycolysis-Based Genes Associated with the Clinical Outcome of Pancreatic Ductal Adenocarcinoma Identified by The Cancer Genome Atlas Data Analysis

DNA and Cell Biology ◽

10.1089/dna.2019.5089 ◽

2020 ◽

Vol 39 (3) ◽

pp. 417-427 ◽

Cited By ~ 5

Author(s):

Guangwei Tian ◽

Guang Li ◽

Peipei Liu ◽

Zihui Wang ◽

Nan Li

Keyword(s):

Data Analysis ◽

Clinical Outcome ◽

Pancreatic Ductal Adenocarcinoma ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Ductal Adenocarcinoma ◽

Atlas Data ◽

Cancer Genome Atlas ◽

Genome Atlas

Download Full-text

A Mixture Copula Bayesian Network Model for Multimodal Genomic Data

10.1101/110288 ◽

2017 ◽

Cited By ~ 1

Author(s):

Qingyang Zhang ◽

Xuan Shi

Keyword(s):

Bayesian Networks ◽

Bayesian Network ◽

Network Model ◽

Network Structure ◽

Genomic Data ◽

The Cancer Genome Atlas ◽

Bayesian Network Model ◽

Atlas Data ◽

Cancer Genome Atlas ◽

Genome Atlas

AbstractGaussian Bayesian networks have become a widely used framework to estimate directed associations between joint Gaussian variables, where the network structure encodes decomposition of multivariate normal density into local terms. However, the resulting estimates can be inaccurate when normality assumption is moderately or severely violated, making it unsuitable to deal with recent genomic data such as the Cancer Genome Atlas data. In the present paper, we propose a mixture copula Bayesian network model which provides great flexibility in modeling non-Gaussian and multimodal data for causal inference. The parameters in mixture copula functions can be efficiently estimated by a routine Expectation-Maximization algorithm. A heuristic search algorithm based on Bayesian information criterion is developed to estimate the network structure, and prediction can be further improved by the best-scoring network out of multiple predictions from random initial values. Our method outperforms Gaussian Bayesian networks and regular copula Bayesian networks in terms of modeling flexibility and prediction accuracy, as demonstrated using a cell signaling dataset. We apply the proposed methods to the Cancer Genome Atlas data to study the genetic and epigenetic pathways that underlie serous ovarian cancer.

Download Full-text

Omics Pipe: A Computational Framework for Reproducible Multi-Omics Data Analysis

10.1101/008383 ◽

2014 ◽

Cited By ~ 1

Author(s):

Kathleen M Fisch ◽

Tobias Meißner ◽

Louis Gioia ◽

Jean-Christophe Ducom ◽

Tristan Carland ◽

...

Keyword(s):

Data Analysis ◽

High Performance ◽

Best Practice ◽

The Cancer Genome Atlas ◽

Sequencing Analysis ◽

Omics Data ◽

Next Generation Sequencing Analysis ◽

Cancer Genome Atlas ◽

Computational Platform ◽

Omics Data Analysis

Omics Pipe (https://bitbucket.org/sulab/omics_pipe) is a computational platform that automates multi-omics data analysis pipelines on high performance compute clusters and in the cloud. It supports best practice published pipelines for RNA-seq, miRNA-seq, Exome-seq, Whole Genome sequencing, ChIP-seq analyses and automatic processing of data from The Cancer Genome Atlas. Omics Pipe provides researchers with a tool for reproducible, open source and extensible next generation sequencing analysis.

Download Full-text