Integration and analysis of CPTAC proteomics data in the context of cancer genomics in the cBioPortal

Mapping Intimacies ◽

10.1101/247718 ◽

2018 ◽

Author(s):

Pamela Wu ◽

Zachary J Heins ◽

James T Muller ◽

Adam A Abeshouse ◽

Yichao Sun ◽

...

Keyword(s):

Mass Spectrometry ◽

Clinical Data ◽

Cancer Genomics ◽

Ovarian Tumors ◽

The Cancer Genome Atlas ◽

Mass Spectrometry Data ◽

Proteomics Data ◽

Data Types ◽

Level Data ◽

Graphical Summary

SummaryThe Clinical Proteomic Tumor Analysis Consortium (CPTAC) has produced extensive mass spectrometry based proteomics data for selected breast, colon and ovarian tumors from The Cancer Genome Atlas (TCGA). We have incorporated the CPTAC proteomics data into the cBioPotal to support easy exploration and integrative analysis of these proteomic datasets in the context of the clinical and genomics data from the same tumors. cBioPortal is an open source platform for exploring, visualizing, and analyzing multi-dimensional cancer genomics and clinical data. The public instance of the cBioPortal (http://cbioportal.org/) hosts more than 100 cancer genomics studies including all of the data from TCGA. Its biologist-friendly interface provides many rich analysis features, including a graphical summary of gene-level data across multiple platforms, correlation analysis between genes or other data types, survival analysis, and network visualization. Here, we present the integration of the CPTAC mass spectrometry based proteomics data into the cBioPortal, consisting of 77 breast, 95 colorectal, and 174 ovarian tumors that already have been profiled by TCGA for mutations, copy number alterations, gene expression, and DNA methylation. As a result, the CPTAC data can now be easily explored and analyzed in the cBioPortal in the context of clinical and genomics data. By integrating CPTAC data into cBioPortal, limitations of TCGA proteomics array data can be overcome while also providing a user-friendly web interface, a web API and an R client to query the mass spectrometry data together with genomic, epigenomic, and clinical data.

Download Full-text

Multiomic Integration of Public Oncology Databases in Bioconductor

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00119 ◽

2020 ◽

pp. 958-971

Author(s):

Marcel Ramos ◽

Ludwig Geistlinger ◽

Sehyun Oh ◽

Lucas Schiffer ◽

Rimsha Azhar ◽

...

Keyword(s):

Web Application ◽

Cancer Genomics ◽

Application Programming Interface ◽

Data Representation ◽

The Cancer Genome Atlas ◽

Data Sets ◽

Data Types ◽

Data Infrastructure ◽

Integrative Framework ◽

Pan Cancer

PURPOSE Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.

Download Full-text

MSpectraAI: a powerful platform for deciphering proteome profiling of multi-tumor mass spectrometry data by using deep neural networks

BMC Bioinformatics ◽

10.1186/s12859-020-03783-0 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Shisheng Wang ◽

Hongwen Zhu ◽

Hu Zhou ◽

Jingqiu Cheng ◽

Hao Yang

Keyword(s):

Mass Spectrometry ◽

Neural Networks ◽

Large Scale ◽

Deep Neural Networks ◽

Spectral Feature ◽

Mass Spectrometry Data ◽

Learning Approaches ◽

Proteomics Data ◽

Proteome Profiling ◽

Analytical Technique

Abstract Background Mass spectrometry (MS) has become a promising analytical technique to acquire proteomics information for the characterization of biological samples. Nevertheless, most studies focus on the final proteins identified through a suite of algorithms by using partial MS spectra to compare with the sequence database, while the pattern recognition and classification of raw mass-spectrometric data remain unresolved. Results We developed an open-source and comprehensive platform, named MSpectraAI, for analyzing large-scale MS data through deep neural networks (DNNs); this system involves spectral-feature swath extraction, classification, and visualization. Moreover, this platform allows users to create their own DNN model by using Keras. To evaluate this tool, we collected the publicly available proteomics datasets of six tumor types (a total of 7,997,805 mass spectra) from the ProteomeXchange consortium and classified the samples based on the spectra profiling. The results suggest that MSpectraAI can distinguish different types of samples based on the fingerprint spectrum and achieve better prediction accuracy in MS1 level (average 0.967). Conclusion This study deciphers proteome profiling of raw mass spectrometry data and broadens the promising application of the classification and prediction of proteomics data from multi-tumor samples using deep learning methods. MSpectraAI also shows a better performance compared to the other classical machine learning approaches.

Download Full-text

Do we want our data raw? Including binary mass spectrometry data in public proteomics data repositories

PROTEOMICS ◽

10.1002/pmic.200401302 ◽

2005 ◽

Vol 5 (13) ◽

pp. 3501-3505 ◽

Cited By ~ 42

Author(s):

Lennart Martens ◽

Alexey I. Nesvizhskii ◽

Henning Hermjakob ◽

Marcin Adamski ◽

Gilbert S. Omenn ◽

...

Keyword(s):

Mass Spectrometry ◽

Mass Spectrometry Data ◽

Proteomics Data ◽

Data Repositories

Download Full-text

TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages

F1000Research ◽

10.12688/f1000research.8923.2 ◽

2016 ◽

Vol 5 ◽

pp. 1542 ◽

Cited By ~ 12

Author(s):

Tiago C. Silva ◽

Antonio Colaprico ◽

Catharina Olsen ◽

Fulvio D'Angelo ◽

Gianluca Bontempi ◽

...

Keyword(s):

Cancer Genomics ◽

High Grade Glioma ◽

Molecular Data ◽

The Cancer Genome Atlas ◽

Low Grade ◽

Data Types ◽

Biologically Relevant ◽

Public Projects ◽

Link Type ◽

Dna Elements

Biotechnological advances in sequencing have led to an explosion of publicly available data via large international consortia such as The Cancer Genome Atlas (TCGA), The Encyclopedia of DNA Elements (ENCODE), and The NIH Roadmap Epigenomics Mapping Consortium (Roadmap). These projects have provided unprecedented opportunities to interrogate the epigenome of cultured cancer cell lines as well as normal and tumor tissues with high genomic resolution. The Bioconductor project offers more than 1,000 open-source software and statistical packages to analyze high-throughput genomic data. However, most packages are designed for specific data types (e.g. expression, epigenetics, genomics) and there is no one comprehensive tool that provides a complete integrative analysis of the resources and data provided by all three public projects. A need to create an integration of these different analyses was recently proposed. In this workflow, we provide a series of biologically focused integrative analyses of different molecular data. We describe how to download, process and prepare TCGA data and by harnessing several key Bioconductor packages, we describe how to extract biologically meaningful genomic and epigenomic data. Using Roadmap and ENCODE data, we provide a work plan to identify biologically relevant functional epigenomic elements associated with cancer. To illustrate our workflow, we analyzed two types of brain tumors: low-grade glioma (LGG) versus high-grade glioma (glioblastoma multiform or GBM). This workflow introduces the following Bioconductor packages: AnnotationHub, ChIPSeeker, ComplexHeatmap, pathview, ELMER, GAIA, MINET, RTCGAToolbox, TCGAbiolinks.

Download Full-text

CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis

BMC Bioinformatics ◽

10.1186/s12859-021-03969-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Olga Permiakova ◽

Romain Guibert ◽

Alexandra Kraut ◽

Thomas Fortin ◽

Anne-Marie Hesse ◽

...

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Clustering Algorithm ◽

Optimal Transport ◽

State Of The Art ◽

Data Representation ◽

Machine Learning Algorithms ◽

Mass Spectrometry Data ◽

Proteomics Data ◽

Chromatographic Elution

Abstract Background The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. Results We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. Conclusions Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.

Download Full-text

Do we want our data raw? Including binary mass spectrometry data in public proteomics data repositories

Exploring the Human Plasma Proteome ◽

10.1002/9783527609482.ch15 ◽

2006 ◽

pp. 323-328

Author(s):

Lennart Martens ◽

Alexey I. Nesvizhskii ◽

Henning Hermjakob ◽

Marcin Adamski ◽

Gilbert S. Omenn ◽

...

Keyword(s):

Mass Spectrometry ◽

Mass Spectrometry Data ◽

Proteomics Data ◽

Data Repositories

Download Full-text

Systematically Characterizing A-to-I RNA Editing Neoantigens in Cancer

Frontiers in Oncology ◽

10.3389/fonc.2020.593989 ◽

2020 ◽

Vol 10 ◽

Author(s):

Chi Zhou ◽

Zhiting Wei ◽

Liye Zhang ◽

Zhaoyi Yang ◽

Qi Liu

Keyword(s):

Mass Spectrometry ◽

Rna Editing ◽

T Cell Activation ◽

Ovarian Tumor ◽

Cell Activation ◽

Immune Cell ◽

The Cancer Genome Atlas ◽

Mass Spectrometry Data ◽

Immune Cell Population ◽

Two Samples

A-to-I RNA editing can contribute to the transcriptomic and proteomic diversity of many diseases including cancer. It has been reported that peptides generated from RNA editing could be naturally presented by human leukocyte antigen (HLA) molecules and elicit CD8+ T cell activation. However, a systematical characterization of A-to-I RNA editing neoantigens in cancer is still lacking. Here, an integrated RNA-editing based neoantigen identification pipeline PREP (Prioritizing of RNA Editing-based Peptides) was presented. A comprehensive RNA editing neoantigen profile analysis on 12 cancer types from The Cancer Genome Atlas (TCGA) cohorts was performed. PREP was also applied to 14 ovarian tumor samples and two clinical melanoma cohorts treated with immunotherapy. We finally proposed an RNA editing neoantigen immunogenicity score scheme, i.e. REscore, which takes RNA editing level and infiltrating immune cell population into consideration. We reported variant peptide from protein IFI30 in breast cancer which was confirmed expressed and presented in two samples with mass spectrometry data support. We showed that RNA editing neoantigen could be identified from RNA-seq data and could be validated with mass spectrometry data in ovarian tumor samples. Furthermore, we characterized the RNA editing neoantigen profile of clinical melanoma cohorts treated with immunotherapy. Finally, REscore showed significant associations with improved overall survival in melanoma cohorts treated with immunotherapy. These findings provided novel insights of cancer biomarker and enhance our understanding of neoantigen derived from A-to-I RNA editing as well as more types of candidates for personalized cancer vaccines design in the context of cancer immunotherapy.

Download Full-text

Comparison of classification methods that combine clinical data and high-dimensional mass spectrometry data

BMC Bioinformatics ◽

10.1186/s12859-014-0385-z ◽

2014 ◽

Vol 15 (1) ◽

Cited By ~ 4

Author(s):

Caroline Truntzer ◽

Elise Mostacci ◽

Aline Jeannin ◽

Jean-Michel Petit ◽

Patrick Ducoroy ◽

...

Keyword(s):

Mass Spectrometry ◽

Clinical Data ◽

Mass Spectrometry Data ◽

High Dimensional ◽

Classification Methods

Download Full-text

ppx: Programmatic access to proteomics data repositories

10.1101/2021.05.29.446304 ◽

2021 ◽

Author(s):

William E Fondrie ◽

Wout Bittremieux ◽

William S Noble

Keyword(s):

Mass Spectrometry ◽

Open Science ◽

Mass Spectrometry Data ◽

Reproducible Research ◽

Easy Access ◽

Proteomics Data ◽

Data Repositories ◽

Access To Data ◽

Python Package ◽

Programmatic Access

The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science practices. Open access to the data behind scientific discoveries has become critical to validate published findings and develop new computational tools. Here, we present ppx, a Python package that provides easy, programmatic access to the data stored in ProteomeXchange repositories, such as PRIDE and MassIVE. The ppx package can either be used as a command line tool or a Python package to retrieve the files and metadata associated with a project when provided its identifier. To demonstrate how ppx enhances reproducible research, we used ppx within a Snakemake workflow to reanalyze a published dataset with the open modification search tool ANN-SoLo and compared our reanalysis to the original results. We show that ppx readily integrates into workflows and our reanalysis produced results consistent with the original analysis. We envision that ppx will be a valuable tool for creating reproducible analyses, providing tool developers easy access to data for development, testing, and benchmarking, and enabling the use of mass spectrometry data in data-intensive analyses. The ppx package is freely available and open source under the MIT license at: https://github.com/wfondrie/ppx

Download Full-text

TCGA-Assembler 2: Software Pipeline for Retrieval and Processing of TCGA/CPTAC Data

10.1101/214320 ◽

2017 ◽

Cited By ~ 2

Author(s):

Lin Wei ◽

Zhilin Jin ◽

Shengjie Yang ◽

Yanxun Xu ◽

Yitan Zhu ◽

...

Keyword(s):

Data Storage ◽

Cancer Genomics ◽

Substantial Improvement ◽

The Cancer Genome Atlas ◽

Reproducible Research ◽

Proteomics Data ◽

Software Pipeline ◽

Storage And Retrieval ◽

Cancer Genome Atlas ◽

Integrate Data

AbstractMotivationThe Cancer Genome Atlas (TCGA) program has produced huge amounts of cancer genomics data providing unprecedented opportunities for research. In 2014, we developed TCGA-Assembler (Zhu et al, 2014), a software pipeline for retrieval and processing of public TCGA data. In 2016, TCGA data were transferred from the TCGA data portal to the Genomic Data Commons (GDC), which is supported by a different set of data storage and retrieval mechanisms. In addition, new proteomics data of TCGA samples have been generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) program, which were not available for downloading through TCGA-Assembler. It is desirable to acquire and integrate data from both GDC and CPTAC.ResultsWe develop TCGA-Assembler 2 (TA2) to automatically download and integrate data from GDC and CPTAC. We make substantial improvement on the functionality of TA2 to enhance user experience and software performance. TA2 together with its previous version have helped more than 2,000 researchers from 64 countries to access and utilize TCGA and CPTAC data in their research. Availability of TA2 will continue to allow existing and new users to conduct reproducible research based on TCGA and CPTAC data.Availabilityhttp://www.compgenome.org/TCGA-Assembler/[email protected] or [email protected]

Download Full-text