OpenGDC: Unifying, Modeling, Integrating Cancer Genomic Data and Clinical Metadata

Eleonora Cappelli; Fabio Cumbo; Anna Bernasconi; Arif Canakoglu; Stefano Ceri; Marco Masseroli; Emanuel Weitschek

doi:10.3390/app10186367

OpenGDC: Unifying, Modeling, Integrating Cancer Genomic Data and Clinical Metadata

Applied Sciences ◽

10.3390/app10186367 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6367

Author(s):

Eleonora Cappelli ◽

Fabio Cumbo ◽

Anna Bernasconi ◽

Arif Canakoglu ◽

Stefano Ceri ◽

...

Keyword(s):

Clinical Data ◽

Data Model ◽

Query Language ◽

Genomic Data ◽

Application Programming Interface ◽

The Cancer Genome Atlas ◽

Genomic Databases ◽

Sequencing Technologies ◽

Efficient Management ◽

Data Portal

Next Generation Sequencing technologies have produced a substantial increase of publicly available genomic data and related clinical/biospecimen information. New models and methods to easily access, integrate and search them effectively are needed. An effort was made by the Genomic Data Commons (GDC), which defined strict procedures for harmonizing genomic and clinical data of cancer, and created the GDC data portal with its application programming interface (API). In this work, we enhance GDC harmonization by applying a state of the art data model (called Genomic Data Model) made of two components: the genomic data, in Browser Extensible Data (BED) format, and the related metadata, in a tab-delimited key-value format. Furthermore, we extend the GDC genomic data with information extracted from other public genomic databases (e.g., GENCODE, HGNC and miRBase). For metadata, we implemented automatic procedures to extract and normalize them, recognizing and eliminating redundant ones, from both Clinical/Biospecimen Supplements and GDC Data Model, that are present on the two sources of GDC (i.e., data portal and API). We developed and released the OpenGDC software, which is able to extract, integrate, extend, and standardize genomic and clinical data of The Cancer Genome Atlas (TCGA) from the GDC. Additionally, we created a publicly accessible repository, containing such homogenized and enhanced TCGA data (resulting in about 1.3 TB). Our approach, implemented in the OpenGDC software, provides a step forward to the effective and efficient management of big genomic and clinical data of cancer. The strong usability of our data model and utility of our work is demonstrated through the application of the GenoMetric Query Language (GMQL) on the transformed TCGA data from the GDC, achieving promising results, facilitating information retrieval and knowledge discovery analyses.

Download Full-text

Abstraction on clinical data sequences: an object-oriented data model and a query language based on the event calculus

Artificial Intelligence in Medicine ◽

10.1016/s0933-3657(99)00022-6 ◽

1999 ◽

Vol 17 (3) ◽

pp. 271-301 ◽

Cited By ~ 28

Author(s):

Carlo Combi ◽

Luca Chittaro

Keyword(s):

Clinical Data ◽

Data Model ◽

Query Language ◽

Object Oriented ◽

Event Calculus

Download Full-text

CoolBox: a interactive genomic data explorer for Jupyter Notebook

10.1101/614222 ◽

2019 ◽

Author(s):

Weize Xu ◽

Da Lin ◽

Ping Hong ◽

Liang Yi ◽

Rohit Tyagi ◽

...

Keyword(s):

Genomic Data ◽

Application Programming Interface ◽

Data Exploration ◽

Rna Seq ◽

Sequencing Technologies ◽

Data Formats ◽

Genomic Data Visualization ◽

Application Programming ◽

Interactive Data ◽

Python Package

AbstractSummaryCoolBox is a Python package for interactive genomic data exploration based on Jupyter notebook. It provides a ggplot2-like Application Programming Interface (API) for genomic data visualization, and a Jupyter/ipywidgets based Graphical User Interface (GUI) for interactive data exploration. CoolBox is a versatile multi-omics explorer supporting most types of data formats generated by various sequencing technologies like RNA-Seq, ChIP-Seq, ChIA-PET and Hi-C.Availability and implementationCoolBox is purely implemented with Python, and the GUI widget in Jupyter notebook is based on the ipywidgets package. It is open-source and available under GPLv3 license at https://github.com/GangCaoLab/CoolBox.

Download Full-text

Extending the Genomic Data Model and the Genometric Query Language with Domain Taxonomies

Lecture Notes in Computer Science - Web Engineering ◽

10.1007/978-3-319-60131-1_44 ◽

2017 ◽

pp. 567-574 ◽

Cited By ~ 1

Author(s):

Eleonora Cappelli ◽

Emanuel Weitschek

Keyword(s):

Data Model ◽

Query Language ◽

Genomic Data

Download Full-text

Analyses of cancer data in the Genomic Data Commons Data Portal with new functionalities in the TCGAbiolinks R/Bioconductor package

10.1101/350439 ◽

2018 ◽

Author(s):

Mohamed Mounir ◽

Tiago C. Silva ◽

Marta Lucchetta ◽

Catharina Olsen ◽

Gianluca Bontempi ◽

...

Keyword(s):

Differential Expression ◽

Differential Expression Analysis ◽

Genomic Data ◽

Tissue Expression ◽

The Cancer Genome Atlas ◽

Bioconductor Package ◽

Cancer Data ◽

Tumor Purity ◽

Data Portal ◽

Data Commons

ABSTRACTThe advent of Next Generation Sequencing (NGS) technologies has opened new perspectives in deciphering the genetic mechanisms underlying complex diseases. Nowadays, the amount of genomic data is massive and substantial efforts and new tools are required to unveil the information hidden in the data.The Genomic Data Commons (GDC) Data Portal is a large data collection platform that includes different genomic studies included the ones from The Cancer Genome Atlas (TCGA) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiatives, accounting for more than 40 tumor types originating from nearly 30000 patients. Such platforms, although very attractive, must make sure the stored data are easily accessible and adequately harmonized. Moreover, they have the primary focus on the data storage in a unique place, and they do not provide a comprehensive toolkit for analyses and interpretation of the data. To fulfill this urgent need, comprehensive but easily accessible computational methods for integrative analyses of genomic data without renouncing a robust statistical and theoretical framework are needed. In this context, the R/Bioconductor package TCGAbiolinks was developed, offering a variety of bioinformatics functionalities. Here we introduce new features and enhancements of TCGAbiolinks in terms of i) more accurate and flexible pipelines for differential expression analyses, ii) different methods for tumor purity estimation and filtering, iii) integration of normal samples from the Genotype-Tissue-Expression (GTEx) platform iv) support for other genomics datasets, here exemplified by the TARGET data.Evidence has shown that accounting for tumor purity is essential in the study of tumorigenesis, as these factors promote confounding behavior regarding differential expression analysis. Henceforth, we implemented these filtering procedures in TCGAbiolinks. Moreover, a limitation of some of the TCGA datasets is the unavailability or paucity of corresponding normal samples. We thus integrated into TCGAbiolinks the possibility to use normal samples from the Genotype-Tissue Expression (GTEx) project, which is another large-scale repository cataloging gene expression from healthy individuals. The new functionalities are available in the TCGABiolinks v 2.8 and higher released in Bioconductor version 3.7.

Download Full-text

The BIRC Family Genes Expression in Patients with Triple Negative Breast Cancer

International Journal of Molecular Sciences ◽

10.3390/ijms22041820 ◽

2021 ◽

Vol 22 (4) ◽

pp. 1820

Author(s):

Anna Makuch-Kocka ◽

Janusz Kocki ◽

Anna Brzozowska ◽

Jacek Bogucki ◽

Przemysław Kołodziej ◽

...

Keyword(s):

Breast Cancer ◽

Gene Expression ◽

Clinical Data ◽

Triple Negative Breast Cancer ◽

Triple Negative ◽

Lymphatic Vessels ◽

The Cancer Genome Atlas ◽

Expression Level ◽

Cancer Tissue ◽

Expression Levels

The BIRC (baculoviral IAP repeat-containing; BIRC) family genes encode for Inhibitor of Apoptosis (IAP) proteins. The dysregulation of the expression levels of the genes in question in cancer tissue as compared to normal tissue suggests that the apoptosis process in cancer cells was disturbed, which may be associated with the development and chemoresistance of triple negative breast cancer (TNBC). In our study, we determined the expression level of eight genes from the BIRC family using the Real-Time PCR method in patients with TNBC and compared the obtained results with clinical data. Additionally, using bioinformatics tools (Ualcan and The Breast Cancer Gene-Expression Miner v4.5 (bc-GenExMiner v4.5)), we compared our data with the data in the Cancer Genome Atlas (TCGA) database. We observed diverse expression pattern among the studied genes in breast cancer tissue. Comparing the expression level of the studied genes with the clinical data, we found that in patients diagnosed with breast cancer under the age of 50, the expression levels of all studied genes were higher compared to patients diagnosed after the age of 50. We observed that in patients with invasion of neoplastic cells into lymphatic vessels and fat tissue, the expression levels of BIRC family genes were lower compared to patients in whom these features were not noted. Statistically significant differences in gene expression were also noted in patients classified into three groups depending on the basis of the Scarff-Bloom and Richardson (SBR) Grading System.

Download Full-text

Immunogenomic Identification for Predicting the Prognosis of Cervical Cancer Patients

International Journal of Molecular Sciences ◽

10.3390/ijms22052442 ◽

2021 ◽

Vol 22 (5) ◽

pp. 2442

Author(s):

Qun Wang ◽

Aurelia Vattai ◽

Theresa Vilsmaier ◽

Till Kaltofen ◽

Alexander Steger ◽

...

Keyword(s):

Cervical Cancer ◽

Clinical Data ◽

Regulatory Network ◽

Univariate Analysis ◽

Predictive Biomarkers ◽

Enrichment Analysis ◽

The Cancer Genome Atlas ◽

Cancer Prognosis ◽

Functional Enrichment ◽

Wilcoxon Test

Cervical cancer is primarily caused by the infection of high-risk human papillomavirus (hrHPV). Moreover, tumor immune microenvironment plays a significant role in the tumorigenesis of cervical cancer. Therefore, it is necessary to comprehensively identify predictive biomarkers from immunogenomics associated with cervical cancer prognosis. The Cancer Genome Atlas (TCGA) public database has stored abundant sequencing or microarray data, and clinical data, offering a feasible and reliable approach for this study. In the present study, gene profile and clinical data were downloaded from TCGA, and the Immunology Database and Analysis Portal (ImmPort) database. Wilcoxon-test was used to compare the difference in gene expression. Univariate analysis was adopted to identify immune-related genes (IRGs) and transcription factors (TFs) correlated with survival. A prognostic prediction model was established by multivariate cox analysis. The regulatory network was constructed and visualized by correlation analysis and Cytoscape, respectively. Gene functional enrichment analysis was performed by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). A total of 204 differentially expressed IRGs were identified, and 22 of them were significantly associated with the survival of cervical cancer. These 22 IRGs were actively involved in the JAK-STAT pathway. A prognostic model based on 10 IRGs (APOD, TFRC, GRN, CSK, HDAC1, NFATC4, BMP6, IL17RD, IL3RA, and LEPR) performed moderately and steadily in squamous cell carcinoma (SCC) patients with FIGO stage I, regardless of the age and grade. Taken together, a risk score model consisting of 10 novel genes capable of predicting survival in SCC patients was identified. Moreover, the regulatory network of IRGs associated with survival (SIRGs) and their TFs provided potential molecular targets.

Download Full-text

Bridging the Gap between Vertebrate Cytogenetics and Genomics with Single-Chromosome Sequencing (ChromSeq)

Genes ◽

10.3390/genes12010124 ◽

2021 ◽

Vol 12 (1) ◽

pp. 124

Author(s):

Alessio Iannucci ◽

Alexey I. Makunin ◽

Artem P. Lisachov ◽

Claudio Ciofi ◽

Roscoe Stanyon ◽

...

Keyword(s):

Genome Evolution ◽

Karyotype Evolution ◽

Genomic Data ◽

Anolis Carolinensis ◽

Vertebrate Genome ◽

Single Chromosome ◽

Sequencing Technologies ◽

Novel Approaches ◽

Genome Assemblies ◽

Generation Sequencing

The study of vertebrate genome evolution is currently facing a revolution, brought about by next generation sequencing technologies that allow researchers to produce nearly complete and error-free genome assemblies. Novel approaches however do not always provide a direct link with information on vertebrate genome evolution gained from cytogenetic approaches. It is useful to preserve and link cytogenetic data with novel genomic discoveries. Sequencing of DNA from single isolated chromosomes (ChromSeq) is an elegant approach to determine the chromosome content and assign genome assemblies to chromosomes, thus bridging the gap between cytogenetics and genomics. The aim of this paper is to describe how ChromSeq can support the study of vertebrate genome evolution and how it can help link cytogenetic and genomic data. We show key examples of ChromSeq application in the refinement of vertebrate genome assemblies and in the study of vertebrate chromosome and karyotype evolution. We also provide a general overview of the approach and a concrete example of genome refinement using this method in the species Anolis carolinensis.

Download Full-text

Extending TCGA queries to automatically identify analogous genomic data from dbGaP

F1000Research ◽

10.12688/f1000research.9837.1 ◽

2017 ◽

Vol 6 ◽

pp. 319

Author(s):

Erin K. Wagner ◽

Satyajeet Raje ◽

Liz Amos ◽

Jessica Kurata ◽

Abhijit S. Badve ◽

...

Keyword(s):

Genomic Data ◽

The Cancer Genome Atlas ◽

Genomic Research ◽

Reproducible Research ◽

Software Pipeline ◽

Individual Level ◽

Related Data ◽

Cancer Genome Atlas ◽

Existing Data ◽

Genome Atlas

Data sharing is critical to advance genomic research by reducing the demand to collect new data by reusing and combining existing data and by promoting reproducible research. The Cancer Genome Atlas (TCGA) is a popular resource for individual-level genotype-phenotype cancer related data. The Database of Genotypes and Phenotypes (dbGaP) contains many datasets similar to those in TCGA. We have created a software pipeline that will allow researchers to discover relevant genomic data from dbGaP, based on matching TCGA metadata. The resulting research provides an easy to use tool to connect these two data sources.

Download Full-text

Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples

Viruses ◽

10.3390/v13102006 ◽

2021 ◽

Vol 13 (10) ◽

pp. 2006

Author(s):

Anna Y Budkina ◽

Elena V Korneenko ◽

Ivan A Kotov ◽

Daniil A Kiselev ◽

Ilya V Artyushin ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Sequencing Data ◽

Viral Pathogens ◽

Genomic Databases ◽

Bioinformatic Pipeline ◽

Viral Genomes ◽

Sequencing Technologies ◽

Viral Screening

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.

Download Full-text

Rapid advancement in cancer genomic big data in the pursuit of precision oncology

Medical Journal of Indonesia ◽

10.13181/mji.rev.204250 ◽

2021 ◽

Author(s):

Tiara Bunga Mayang Permata ◽

Sri Mutya Sekarutami ◽

Endang Nuryadi ◽

Angela Giselvania ◽

Soehartati Gondhowiardjo

Keyword(s):

Big Data ◽

Open Access ◽

Cancer Cell ◽

Cancer Cell Line ◽

Genomic Data ◽

The Cancer Genome Atlas ◽

Clinical Samples ◽

Precision Oncology ◽

Cancer Data ◽

User Friendly

In the current big data era, massive genomic cancer data are available for open access from anywhere in the world. They are obtained from popular platforms, such as The Cancer Genome Atlas, which provides genetic information from clinical samples, and Cancer Cell Line Encyclopedia, which offers genomic data of cancer cell lines. For convenient analysis, user-friendly tools, such as the Tumor Immune Estimation Resource (TIMER), which can be used to analyze tumor-infiltrating immune cells comprehensively, are also emerging. In clinical practice, clinical sequencing has been recommended for patients with cancer in many countries. Despite its many challenges, it enables the application of precision medicine, especially in medical oncology. In this review, several efforts devoted to accomplishing precision oncology and applying big data for use in Indonesia are discussed. Utilizing open access genomic data in writing research articles is also described.

Download Full-text