scholarly journals Epiviz File Server: Query, Transform and Interactively Explore Data from Indexed Genomic Files

2019 ◽  
Author(s):  
Jayaram Kancherla ◽  
Yifan Yang ◽  
Hyeyun Chae ◽  
Hector Corrada Bravo

AbstractGenomic data repositories like The Cancer Genome Atlas (TCGA), Encyclopedia of DNA Elements (ENCODE), Bioconductor’s AnnotationHub and ExperimentHub etc., provide public access to large amounts of genomic data as flat files. Researchers often download a subset of files data from these repositories to perform their data analysis. As these data repositories become larger, researchers often face bottlenecks in their exploratory data analysis. Based on the concepts of a NoDB paradigm, we developed epivizFileServer, a Python library that implements an in-situ data query system for local or remotely hosted indexed genomic files, not only for visualization but also data manipulation. The File Server library decouples data from analysis workflows and provides an abstract interface to define computations independent of the location, format or structure of the file.

2020 ◽  
Vol 36 (18) ◽  
pp. 4682-4690 ◽  
Author(s):  
Jayaram Kancherla ◽  
Yifan Yang ◽  
Hyeyun Chae ◽  
Hector Corrada Bravo

Abstract Motivation Genomic data repositories like The Cancer Genome Atlas, Encyclopedia of DNA Elements, Bioconductor’s AnnotationHub and ExperimentHub etc., provide public access to large amounts of genomic data as flat files. Researchers often download a subset of data files from these repositories to perform exploratory data analysis. We developed Epiviz File Server, a Python library that implements an in situ data query system for local or remotely hosted indexed genomic files, not only for visualization but also data transformation. The File Server library decouples data retrieval and transformation from specific visualization and analysis tools and provides an abstract interface to define computations independent of the location, format or structure of the file. We demonstrate the File Server in two use cases: (i) integration with Galaxy workflows and (ii) using Epiviz to create a custom genome browser from the Epigenome Roadmap dataset. Availability and implementation Epiviz File Server is open source and is available on GitHub at http://github.com/epiviz/epivizFileServer. The documentation for the File Server library is available at http://epivizfileserver.rtfd.io.


F1000Research ◽  
2017 ◽  
Vol 6 ◽  
pp. 319
Author(s):  
Erin K. Wagner ◽  
Satyajeet Raje ◽  
Liz Amos ◽  
Jessica Kurata ◽  
Abhijit S. Badve ◽  
...  

Data sharing is critical to advance genomic research by reducing the demand to collect new data by reusing and combining existing data and by promoting reproducible research. The Cancer Genome Atlas (TCGA) is a popular resource for individual-level genotype-phenotype cancer related data. The Database of Genotypes and Phenotypes (dbGaP) contains many datasets similar to those in TCGA. We have created a software pipeline that will allow researchers to discover relevant genomic data from dbGaP, based on matching TCGA metadata. The resulting research provides an easy to use tool to connect these two data sources.


2018 ◽  
Author(s):  
Seo Jeong Shin ◽  
Seng Chan You ◽  
Yu Rang Park ◽  
Jin Roh ◽  
Jang-Hee Kim ◽  
...  

BACKGROUND Clinical sequencing data should be shared in order to achieve the sufficient scale and diversity required to provide strong evidence for improving patient care. A distributed research network allows researchers to share this evidence rather than the patient-level data across centers, thereby avoiding privacy issues. The Observational Medical Outcomes Partnership (OMOP) common data model (CDM) used in distributed research networks has low coverage of sequencing data and does not reflect the latest trends of precision medicine. OBJECTIVE The aim of this study was to develop and evaluate the feasibility of a genomic CDM (G-CDM), as an extension of the OMOP-CDM, for application of genomic data in clinical practice. METHODS Existing genomic data models and sequencing reports were reviewed to extend the OMOP-CDM to cover genomic data. The Human Genome Organisation Gene Nomenclature Committee and Human Genome Variation Society nomenclature were adopted to standardize the terminology in the model. Sequencing data of 114 and 1060 patients with lung cancer were obtained from the Ajou University School of Medicine database of Ajou University Hospital and The Cancer Genome Atlas, respectively, which were transformed to a format appropriate for the G-CDM. The data were compared with respect to gene name, variant type, and actionable mutations. RESULTS The G-CDM was extended into four tables linked to tables of the OMOP-CDM. Upon comparison with The Cancer Genome Atlas data, a clinically actionable mutation, p.Leu858Arg, in the EGFR gene was 6.64 times more frequent in the Ajou University School of Medicine database, while the p.Gly12Xaa mutation in the KRAS gene was 2.02 times more frequent in The Cancer Genome Atlas dataset. The data-exploring tool GeneProfiler was further developed to conduct descriptive analyses automatically using the G-CDM, which provides the proportions of genes, variant types, and actionable mutations. GeneProfiler also allows for querying the specific gene name and Human Genome Variation Society nomenclature to calculate the proportion of patients with a given mutation. CONCLUSIONS We developed the G-CDM for effective integration of genomic data with standardized clinical data, allowing for data sharing across institutes. The feasibility of the G-CDM was validated by assessing the differences in data characteristics between two different genomic databases through the proposed data-exploring tool GeneProfiler. The G-CDM may facilitate analyses of interoperating clinical and genomic datasets across multiple institutions, minimizing privacy issues and enabling researchers to better understand the characteristics of patients and promote personalized medicine in clinical practice.


2017 ◽  
Vol 35 (15_suppl) ◽  
pp. e23195-e23195
Author(s):  
Jason Mezey ◽  
Steven Schwager ◽  
Sushila Shenoy ◽  
Jef Benbanaste ◽  
Michael Elashoff ◽  
...  

e23195 Background: Clustering algorithms have identified subtypes of major cancers from analysis of genome-wide gene expression (GE) and somatic mutation (SM) profiles. These algorithms almost never discover a proper subset cluster, a recovered cluster that includes all the samples of a specific subtype. For breast cancer (BC), clustering of genome-wide profiles has been unable to proper subset triple negatives (TNs), TN subtypes, or other major subtypes. Methods: To search for a proper subset cluster for TNs, we applied a new clustering algorithm to the public domain GE and SM data of BC samples in The Cancer Genome Atlas (TCGA). A module of Medidata’s Clinical Trial Genomics (CTG) platform for automated clinical and genomic data integration and analysis, it uses a hierarchical component with tree learned cut points applied to a principal component dimension reduced similarity matrix calculated from a genome-wide data profile. Results: Our analysis of 540 TCGA BC samples run without human supervision produced a proper subset cluster that included all 55 TN samples and only 74 non-TN samples. GE data have previously indicated TN status, but this is the first demonstration that these TCGA BC data contain enough information to proper subset TNs, implying that this broad BC subtype has a strong, quantifiable impact on GE. We show that the genome-wide SMs of TCGA BC samples can be used to proper subset 4 novel subtypes distinguished as classes “TP53 mutated”, “PIK3CA mutated”, “both TP53 and PIK3CA mutated”, and “neither mutated”, signifying an important role for these known driver mutations in producing the subtypes’ genome-wide mutation profiles. We find that most ( > 80%) TN BCs are in “TP53 mutated” but only 1 TN sample ( < 2%) is in “PIK3CA mutated”, indicating distinct biology for these TNs with potential implications for TN therapy. Conclusions: CTG clustering achieves proper subset cancer subtype clustering of TCGA BC samples. These results illustrate the therapeutic discovery potential possible from genomic data of the high quality present in TCGA if combined with detailed clinical data with the Medidata CTG integration and annotation platform.


2017 ◽  
Author(s):  
Qingyang Zhang ◽  
Xuan Shi

AbstractGaussian Bayesian networks have become a widely used framework to estimate directed associations between joint Gaussian variables, where the network structure encodes decomposition of multivariate normal density into local terms. However, the resulting estimates can be inaccurate when normality assumption is moderately or severely violated, making it unsuitable to deal with recent genomic data such as the Cancer Genome Atlas data. In the present paper, we propose a mixture copula Bayesian network model which provides great flexibility in modeling non-Gaussian and multimodal data for causal inference. The parameters in mixture copula functions can be efficiently estimated by a routine Expectation-Maximization algorithm. A heuristic search algorithm based on Bayesian information criterion is developed to estimate the network structure, and prediction can be further improved by the best-scoring network out of multiple predictions from random initial values. Our method outperforms Gaussian Bayesian networks and regular copula Bayesian networks in terms of modeling flexibility and prediction accuracy, as demonstrated using a cell signaling dataset. We apply the proposed methods to the Cancer Genome Atlas data to study the genetic and epigenetic pathways that underlie serous ovarian cancer.


2014 ◽  
Author(s):  
Kathleen M Fisch ◽  
Tobias Meißner ◽  
Louis Gioia ◽  
Jean-Christophe Ducom ◽  
Tristan Carland ◽  
...  

Omics Pipe (https://bitbucket.org/sulab/omics_pipe) is a computational platform that automates multi-omics data analysis pipelines on high performance compute clusters and in the cloud. It supports best practice published pipelines for RNA-seq, miRNA-seq, Exome-seq, Whole Genome sequencing, ChIP-seq analyses and automatic processing of data from The Cancer Genome Atlas. Omics Pipe provides researchers with a tool for reproducible, open source and extensible next generation sequencing analysis.


Sign in / Sign up

Export Citation Format

Share Document