Population Substructure Has Implications in Validating Next-Generation Cancer Genomics Studies with TCGA

Marina Miller; Eric Devor; Erin Salinas; Andreea Newtson; Michael Goodheart; Kimberly Leslie; Jesus Gonzalez-Bosquet

doi:10.3390/ijms20051192

Population Substructure Has Implications in Validating Next-Generation Cancer Genomics Studies with TCGA

International Journal of Molecular Sciences ◽

10.3390/ijms20051192 ◽

2019 ◽

Vol 20 (5) ◽

pp. 1192 ◽

Cited By ~ 2

Author(s):

Marina Miller ◽

Eric Devor ◽

Erin Salinas ◽

Andreea Newtson ◽

Michael Goodheart ◽

...

Keyword(s):

Cancer Genomics ◽

The Cancer Genome Atlas ◽

Population Substructure ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

University Of Iowa ◽

Data Set ◽

Using Data ◽

Genetic Substructure

In the era of large genetic and genomic datasets, it has become crucially important to validate results of individual studies using data from publicly available sources, such as The Cancer Genome Atlas (TCGA). However, how generalizable are results from either an independent or a large public dataset to the remainder of the population? The study presented here aims to answer that question. Utilizing next generation sequencing data from endometrial and ovarian cancer patients from both the University of Iowa and TCGA, genomic admixture of each population was analyzed using STRUCTURE and ADMIXTURE software. In our independent data set, one subpopulation was identified, whereas in TCGA 4–6 subpopulations were identified. Data presented here demonstrate how different the genetic substructures of the TCGA and University of Iowa populations are. Validation of genomic studies between two different population samples must be aware of, account for and be corrected for background genetic substructure.

Download Full-text

BCO App: tools for generating BioCompute Objects from next-generation sequencing workflows and computations

F1000Research ◽

10.12688/f1000research.25902.1 ◽

2020 ◽

Vol 9 ◽

pp. 1144

Author(s):

Nan Xiao ◽

Soner Koc ◽

David Roberson ◽

Phillip Brooks ◽

Manisha Ray ◽

...

Keyword(s):

Next Generation Sequencing ◽

Cancer Genomics ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Task Execution ◽

Regulatory Submissions ◽

Time Required ◽

Computational Platform ◽

Generation Sequencing

The BioCompute Object (BCO) standard is an IEEE standard (IEEE 2791-2020) designed to facilitate the communication of next-generation sequencing data analysis with applications across academia, government agencies, and industry. For example, the Food and Drug Administration (FDA) supports the standard for regulatory submissions and includes the standard in their Data Standards Catalog for the submission of HTS data. We created the BCO App to facilitate BCO generation in a range of computational environments and, in part, to participate in the Advanced Track of the precisionFDA BioCompute Object App-a-thon. The application facilitates the generation of BCOs from both workflow metadata provided as plaintext and from workflow contents written in the Common Workflow Language. The application can also access and ingest task execution results from the Cancer Genomics Cloud (CGC), an NCI funded computational platform. Creating a BCO from a CGC task significantly reduces the time required to generate a BCO on the CGC by auto-populating workflow information fields from CGC workflow and task execution results. The BCO App supports exporting BCOs as JSON or PDF files and publishing BCOs to both the CGC platform and to GitHub repositories.

Download Full-text

pTuneos: prioritizing tumor neoantigens from next-generation sequencing data

Genome Medicine ◽

10.1186/s13073-019-0679-x ◽

2019 ◽

Vol 11 (1) ◽

Cited By ~ 4

Author(s):

Chi Zhou ◽

Zhiting Wei ◽

Zhanbing Zhang ◽

Biyu Zhang ◽

Chenyu Zhu ◽

...

Keyword(s):

Cytotoxic T Cells ◽

Rapid Identification ◽

The Cancer Genome Atlas ◽

Next Generation Sequencing Data ◽

Survival Prediction ◽

Next Generation ◽

Sequencing Data ◽

Melanoma Cancer ◽

Link Type ◽

Cohort Data

Abstract Background Cancer neoantigens are expressed only in cancer cells and presented on the tumor cell surface in complex with major histocompatibility complex (MHC) class I proteins for recognition by cytotoxic T cells. Accurate and rapid identification of neoantigens play a pivotal role in cancer immunotherapy. Although several in silico tools for neoantigen prediction have been presented, limitations of these tools exist. Results We developed pTuneos, a computational pipeline for prioritizing tumor neoantigens from next-generation sequencing data. We tested the performance of pTuneos on the melanoma cancer vaccine cohort data and tumor-infiltrating lymphocyte (TIL)-recognized neopeptide data. pTuneos is able to predict the MHC presentation and T cell recognition ability of the candidate neoantigens, and the actual immunogenicity of single-nucleotide variant (SNV)-based neopeptides considering their natural processing and presentation, surpassing the existing tools with a comprehensive and quantitative benchmark of their neoantigen prioritization performance and running time. pTuneos was further tested on The Cancer Genome Atlas (TCGA) cohort data as well as the melanoma and non-small cell lung cancer (NSCLC) cohort data undergoing checkpoint blockade immunotherapy. The overall neoantigen immunogenicity score proposed by pTuneos is demonstrated to be a powerful and pan-cancer marker for survival prediction compared to traditional well-established biomarkers. Conclusions In summary, pTuneos provides the state-of-the-art one-stop and user-friendly solution for prioritizing SNV-based candidate neoepitopes, which could help to advance research on next-generation cancer immunotherapies and personalized cancer vaccines. pTuneos is available at https://github.com/bm2-lab/pTuneos, with a Docker version for quick deployment at https://cloud.docker.com/u/bm2lab/repository/docker/bm2lab/ptuneos.

Download Full-text

Assessment of Mapping and SNP-Detection Algorithms for Next-Generation Sequencing Data in Cancer Genomics

Next Generation Sequencing in Cancer Research ◽

10.1007/978-1-4614-7645-0_15 ◽

2013 ◽

pp. 301-317 ◽

Cited By ~ 1

Author(s):

Weixin Wang ◽

Feng Xu ◽

Junwen Wang

Keyword(s):

Next Generation Sequencing ◽

Cancer Genomics ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Snp Detection ◽

Detection Algorithms ◽

Generation Sequencing

Download Full-text

HPV-QUEST: A highly customized system for automated HPV sequence analysis capable of processing Next Generation sequencing data set

Bioinformation ◽

10.6026/97320630008386 ◽

2012 ◽

Vol 8 (8) ◽

pp. 386-388

Author(s):

Li Yin ◽

Jiqiang Yao ◽

Brent P Gardner ◽

Kaifen Chang ◽

Fahong Yu ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sequence Analysis ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Data Set ◽

Generation Sequencing

Download Full-text

Using Semantic Web Technologies to Enable Cancer Genomics Discovery at Petabyte Scale

Cancer Informatics ◽

10.1177/1176935118774787 ◽

2018 ◽

Vol 17 ◽

pp. 117693511877478 ◽

Cited By ~ 2

Author(s):

Jovan Cejovic ◽

Jelena Radenkovic ◽

Vladimir Mladenovic ◽

Adam Stanojevic ◽

Milica Miletic ◽

...

Keyword(s):

Semantic Web ◽

Cancer Genomics ◽

Large Data ◽

The Cancer Genome Atlas ◽

Data Sets ◽

Sequencing Data ◽

Semantic Web Technologies ◽

Data Set ◽

Seamless Integration ◽

Cancer Genome Atlas

Increased efforts in cancer genomics research and bioinformatics are producing tremendous amounts of data. These data are diverse in origin, format, and content. As the amount of available sequencing data increase, technologies that make them discoverable and usable are critically needed. In response, we have developed a Semantic Web–based Data Browser, a tool allowing users to visually build and execute ontology-driven queries. This approach simplifies access to available data and improves the process of using them in analyses on the Seven Bridges Cancer Genomics Cloud (CGC; www.cancergenomicscloud.org ). The Data Browser makes large data sets easily explorable and simplifies the retrieval of specific data of interest. Although initially implemented on top of The Cancer Genome Atlas (TCGA) data set, the Data Browser’s architecture allows for seamless integration of other data sets. By deploying it on the CGC, we have enabled remote researchers to access data and perform collaborative investigations.

Download Full-text

HPV-QUEST: A highly customized system for automated HPV sequence analysis capable of processing Next Generation sequencing data set

Bioinformation ◽

10.6026/97320630008388 ◽

2012 ◽

Vol 8 (8) ◽

pp. 388-390 ◽

Cited By ~ 3

Author(s):

Li Yin ◽

Jiqiang Yao ◽

Brent P Gardner ◽

Kaifen Chang ◽

Fahong Yu ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sequence Analysis ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Data Set ◽

Generation Sequencing

Download Full-text

Detection of somatic structural variants from short-read next-generation sequencing data

10.1101/840751 ◽

2019 ◽

Author(s):

Tingting Gong ◽

Vanessa M Hayes ◽

Eva KF Chan

Keyword(s):

Next Generation Sequencing ◽

Cancer Genomics ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Factors Affecting ◽

Ngs Data ◽

Generation Sequencing

AbstractSomatic structural variants (SVs) play a significant role in cancer development and evolution, but are notoriously more difficult to detect than small variants from short-read next-generation sequencing (NGS) data. This is due to a combination of challenges attributed to the purity of tumour samples, tumour heterogeneity, limitations of short-read information from NGS, and sequence alignment ambiguities. In spite of active development of SV detection tools (callers) over the past few years, each method has inherent advantages and limitations. In this review, we highlight some of the important factors affecting somatic SV detection and compared the performance of eight commonly used SV callers. In particular, we focus on the extent of change in sensitivity and precision for detecting different SV types and size ranges from samples with differing variant allele frequencies and sequencing depths of coverage. We highlight the reasons for why some SV callers perform well in some settings but not others, allowing our evaluation findings to be extended beyond the eight SV callers examined in this paper. As the importance of large structural variants become increasingly recognised in cancer genomics, this paper provides a timely review on some of the most impactful factors influencing somatic SV detection and guidance on selecting an appropriate SV caller.

Download Full-text

Targeted variant detection using unaligned RNA-Seq reads

Life Science Alliance ◽

10.26508/lsa.201900336 ◽

2019 ◽

Vol 2 (4) ◽

pp. e201900336 ◽

Cited By ~ 4

Author(s):

Eric Olivier Audemard ◽

Patrick Gendron ◽

Albert Feghaly ◽

Vincent-Philippe Lavallée ◽

Josée Hébert ◽

...

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

The Cancer Genome Atlas ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Targeted Mutation ◽

Computationally Intensive ◽

And Performance ◽

Generation Sequencing

Mutations identified in acute myeloid leukemia patients are useful for prognosis and for selecting targeted therapies. Detection of such mutations using next-generation sequencing data requires a computationally intensive read mapping step followed by several variant calling methods. Targeted mutation identification drastically shifts the usual tradeoff between accuracy and performance by concentrating all computations over a small portion of sequence space. Here, we present km, an efficient approach leveraging k-mer decomposition of reads to identify targeted mutations. Our approach is versatile, as it can detect single-base mutations, several types of insertions and deletions, as well as fusions. We used two independent cohorts (The Cancer Genome Atlas and Leucegene) to show that mutation detection by km is fast, accurate, and mainly limited by sequencing depth. Therefore, km allows the establishment of fast diagnostics from next-generation sequencing data and could be suitable for clinical applications.

Download Full-text

Detection of somatic structural variants from short-read next-generation sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa056 ◽

2020 ◽

Cited By ~ 1

Author(s):

Tingting Gong ◽

Vanessa M Hayes ◽

Eva K F Chan

Keyword(s):

Next Generation Sequencing ◽

Cancer Genomics ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Factors Affecting ◽

Ngs Data ◽

Generation Sequencing

Abstract Somatic structural variants (SVs), which are variants that typically impact >50 nucleotides, play a significant role in cancer development and evolution but are notoriously more difficult to detect than small variants from short-read next-generation sequencing (NGS) data. This is due to a combination of challenges attributed to the purity of tumour samples, tumour heterogeneity, limitations of short-read information from NGS and sequence alignment ambiguities. In spite of active development of SV detection tools (callers) over the past few years, each method has inherent advantages and limitations. In this review, we highlight some of the important factors affecting somatic SV detection and compared the performance of seven commonly used SV callers. In particular, we focus on the extent of change in sensitivity and precision for detecting different SV types and size ranges from samples with differing variant allele frequencies and sequencing depths of coverage. We highlight the reasons for why some SV callers perform well in some settings but not others, allowing our evaluation findings to be extended beyond the seven SV callers examined in this paper. As the importance of large SVs become increasingly recognized in cancer genomics, this paper provides a timely review on some of the most impactful factors influencing somatic SV detection that should be considered when choosing SV callers.

Download Full-text

Analysis in case–control sequencing association studies with different sequencing depths

Biostatistics ◽

10.1093/biostatistics/kxy073 ◽

2018 ◽

Vol 21 (3) ◽

pp. 577-593

Author(s):

Sixing Chen ◽

Xihong Lin

Keyword(s):

Next Generation Sequencing ◽

Type I Error ◽

Association Studies ◽

Likelihood Method ◽

Next Generation Sequencing Data ◽

Type I ◽

Next Generation ◽

Sequencing Data ◽

Data Set ◽

Generation Sequencing

Summary With the advent of next-generation sequencing, investigators have access to higher quality sequencing data. However, to sequence all samples in a study using next generation sequencing can still be prohibitively expensive. One potential remedy could be to combine next generation sequencing data from cases with publicly available sequencing data for controls, but there could be a systematic difference in quality of sequenced data, such as sequencing depths, between sequenced study cases and publicly available controls. We propose a regression calibration (RC)-based method and a maximum-likelihood method for conducting an association study with such a combined sample by accounting for differential sequencing errors between cases and controls. The methods allow for adjusting for covariates, such as population stratification as confounders. Both methods control type I error and have comparable power to analysis conducted using the true genotype with sufficiently high but different sequencing depths. We show that the RC method allows for analysis using naive variance estimate (closely approximates true variance in practice) and standard software under certain circumstances. We evaluate the performance of the proposed methods using simulation studies and apply our methods to a combined data set of exome sequenced acute lung injury cases and healthy controls from the 1000 Genomes project.

Download Full-text