scholarly journals Substantial Batch Effects in TCGA Exome Sequences Undermine Pan-Cancer Analysis of Germline Variants

2018 ◽  
Author(s):  
Roni Rasnic ◽  
Nadav Brandes ◽  
Or Zuk ◽  
Michal Linial

ABSTRACTBackgroundIn recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from >10,000 patients.MethodsOur hypothesis in this study is that whole exome sequences from healthy blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2,241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity.ResultsWe report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants.ConclusionTCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects.

2016 ◽  
Author(s):  
Alexandra R. Buckley ◽  
Kristopher A. Standish ◽  
Kunal Bhutani ◽  
Trey Ideker ◽  
Hannah Carter ◽  
...  

AbstractThe degree to which germline variation drives cancer development and shapes tumor phenotypes remains largely unexplored, possibly due to a lack of large scale publicly available germline data for a cancer cohort. Here we called germline variants on 9,618 cases from The Cancer Genome Atlas (TCGA) database representing 31 cancer types. We identified batch effects affecting loss of function (LOF) variant calls that can be traced back to differences in the way the sequence data were generated both within and across cancer types. Overall, LOF indel calls were more sensitive to technical artifacts than LOF Single Nucleotide Variant (SNV) calls. In particular, whole genome amplification of DNA prior to sequencing led to an artificially increased burden of LOF indel calls, which confounded association analyses relating germline variants to tumor type despite stringent indel filtering strategies. Due to the inherent noise we chose to remove all 614 amplified DNA samples, including all acute myeloid leukemia and virtually all ovarian cancer samples, from the final dataset. This study demonstrates how insufficient quality control can lead to false positive germlinetumor type associations and draws attention to the need to be sensitive to problems associated with a lack of uniformity in data generation in TCGA data.Author SummaryCancer research to date has largely focused on genetic aberrations specific to tumor tissue. In contrast, the degree to which germline, or inherited, variation contributes to tumorigenesis remains unclear, possibly due to a lack of accessible germline variant data. In this study we identify germline variants in 9,618 samples using raw germline exome data from The Cancer Genome Atlas (TCGA). There are substantial differences in the way exome sequence data was generated both across and within cancer types in TCGA. We observe that differences in sequence data generation introduced batch effects, or variation that is due to technical factors not true biological variation, in our variant data. Most notably, we observe that amplification of DNA prior to sequencing resulted in an excess of predicted damaging indel variants. We show how these batch effects can confound germline association analyses if not properly addressed. Our study highlights the difficulties of working with large public genomic datasets like TCGA where samples are collected over time and across data centers, and particularly cautions the use of amplified DNA samples for genetic association analyses.


2021 ◽  
pp. 1-10
Author(s):  
Zoe Guan ◽  
Ronglai Shen ◽  
Colin B. Begg

<b><i>Background:</i></b> Many cancer types show considerable heritability, and extensive research has been done to identify germline susceptibility variants. Linkage studies have discovered many rare high-risk variants, and genome-wide association studies (GWAS) have discovered many common low-risk variants. However, it is believed that a considerable proportion of the heritability of cancer remains unexplained by known susceptibility variants. The “rare variant hypothesis” proposes that much of the missing heritability lies in rare variants that cannot reliably be detected by linkage analysis or GWAS. Until recently, high sequencing costs have precluded extensive surveys of rare variants, but technological advances have now made it possible to analyze rare variants on a much greater scale. <b><i>Objectives:</i></b> In this study, we investigated associations between rare variants and 14 cancer types. <b><i>Methods:</i></b> We ran association tests using whole-exome sequencing data from The Cancer Genome Atlas (TCGA) and validated the findings using data from the Pan-Cancer Analysis of Whole Genomes Consortium (PCAWG). <b><i>Results:</i></b> We identified four significant associations in TCGA, only one of which was replicated in PCAWG (BRCA1 and ovarian cancer). <b><i>Conclusions:</i></b> Our results provide little evidence in favor of the rare variant hypothesis. Much larger sample sizes may be needed to detect undiscovered rare cancer variants.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e14576-e14576
Author(s):  
Xinlu Liu ◽  
Jiasheng Xu ◽  
Jian Sun ◽  
Deng Wei ◽  
Xinsheng Zhang ◽  
...  

e14576 Background: Clinically, MSI had been used as an important molecular marker for the prognosis of colorectal cancer and other solid tumors and the formulation of adjuvant treatment plans, and it had been used to assist in the screening of Lynch syndrome. However, there were currently few reports on the incidence of MSI-H in Chinese pan-cancer patients. This study described the occurrence of MSI in a large multi-center pan-cancer cohort in China, and explored the correlation between MSI and patients' TMB, age, PD-L1 expression and other indicators. Methods: The study included 8361 patients with 8 cancer types from multiple tumor centers. Use immunohistochemistry to detect the expression of MMR protein (MLH1, MSH2, MSH6 and PMS2) in patients with various cancer types to determine the MSI status and detect the expression of PD-L1 in patients. Through NGS technology, 831 genes of 8361 Chinese cancer patients were sequenced and the tumor mutation load of the patients was calculated. The MSI mutations of patients in 8 cancer types were analyzed and the correlation between MSI mutations of patients and the patient's age, TMB and PD-L1 expression was analyzed. Results: The test results showed that MSI patients accounted for 1.66% of pan-cancers. Among them, MSI-H patients accounted for the highest proportion in intestinal cancer, reaching 7.2%. The correlation analysis between MSI and TMB was performed on patients of various cancer types. The results showed that: in each cancer type, MSI-H patients had TMB greater than 10, and 26.83% of MSI-H patients had TMB greater than 100 in colorectal cancer patients. The result of correlation analysis showed that there was no significant correlation between the patient's age and the risk of MSI mutation ( P> 0.05). In addition to PAAD and LUAD, the expression of PD-L1 in MSI-H patients was higher than that in MSS patients in other cancer types( P< 0.05). The correlation analysis between PD-L1 expression and TMB in patients found that in colorectal cancer, the higher the expression of PD-L1, the higher the patient's TMB ( P< 0.05). Conclusions: In this study, we explored the incidence of MSI-H in pan-cancer patients in China and found that the TMB was greater than 10 in patients with MSI-H. Compared with MSS patients, MSI-H patients have higher PD-L1 expression, and the higher the PD-L1 expression in colorectal cancer, the higher the TMB value of patients.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 10544-10544
Author(s):  
Tiancheng Han ◽  
Yuanyuan Hong ◽  
Pei Zhihua ◽  
Song Xiaofeng ◽  
Jianing Yu ◽  
...  

10544 Background: Screening the biomarkers from the cell-free DNA (cfDNA) of peripheral blood is a non-invasive and promising method for cancer diagnosis. Among diverse types of biomarkers, epigenetic biomarkers have been reported to be one of the most promising ones. Epigenetic modifications are widespread on the human genome and generally have strong signals due to the similar methylation patterns shared by adjacent CpG sites. Although some epigenetic diagnostic methods have been developed based on cfDNAs, few of them could be applied to pan-cancer and their sensitivities are barely satisfactory for early cancer detection. Methods: Targeted methylation sequencing was performed using our in-house-designed panel targeting regions with abundant cancer-specific methylation CpGs. The cfDNA samples from 80 healthy individuals and 549 cancer patients of 14 cancer types were separately sequenced. The dataset was randomly split into one discovery dataset and one validation dataset. Moreover, cfDNA samples from four cancer patients were diluted with the healthy cfDNAs to generate 12 in vitro simulated samples with low circulating tumor DNA (ctDNA) fraction. Additionally, DNAs extracted from 130 unmatched tumor formalin fixation and paraffin embedding (FFPE) samples of 10 cancer types were sequenced to screen the diagnostic biomarkers. Adjacent CpG sites were first merged into methylation-correlated blocks (MCB) according to their correlations of methylation levels in tumor DNAs. The MCBs with higher methylation levels in tumor DNAs than that of healthy cfDNAs (from the discovery dataset) were defined as our hypermethylation biomarkers. For each cfDNA sample, a hypermethylation score (HM-score) was computed to measure the overall methylation level difference of selected biomarkers. The performance of our method was evaluated with the real-world dataset, while the limit of detection was estimated using the simulated low-ctDNA samples. Results: Our model based on 37 hypermethylation MCB biomarkers achieved an area under the curve (AUC) of 0.89 and 0.86 in the real-world pan-cancer discovery and validation cfDNA datasets, respectively. Furthermore, the overall specificity and sensitivity are 100% and 76.19% in the discovery dataset, and 96.67% and 72.86% in the validation dataset. In the validation dataset, 28/40 (70%) of early-stage colorectal cancer patients and 10/20 (50%) of non-small-cell lung cancer patients were successfully diagnosed. Additionally, all the simulated samples with theoretical ctDNA factions over 0.5% were predicted as diseased, demonstrating the ability of our method to detect tumor signals at early stages. Conclusions: Our cfDNA-based epigenetic method outperforms currently available methods in various cancer types, and is promising to be applied to early-stage cancer detection and samples with low ctDNA fractions.


2019 ◽  
pp. 1-11
Author(s):  
Zade Akras ◽  
Brandon Bungo ◽  
Brandie H. Leach ◽  
Jessica Marquard ◽  
Manmeet Ahluwalia ◽  
...  

PURPOSE It has been estimated that 5% to 10% of cancers are due to hereditary causes. Recent data sets indicate that the incidence of hereditary cancer may be as high as 17.5% in patients with cancer, and a notable subset is missed if screening is solely by family history and current syndrome-based testing guidelines. Identification of germline variants has implications for both patients and their families. There is currently no comprehensive overview of cancer susceptibility genes or inclusion of these genes in commercially available somatic testing. We aimed to summarize genes linked to hereditary cancer and the somatic and germline panels that include such genes. METHODS Germline predisposition genes were chosen if commercially available for testing. Penetrance was defined as low, moderate, or high according to whether the gene conferred a 0% to 20%, 20% to 50%, or 50% to 100% lifetime risk of developing the cancer or, when percentages were not available, was estimated on the basis of existing literature descriptions. RESULTS We identified a total of 89 genes linked to hereditary cancer predisposition, and we summarized these genes alphabetically and by organ system. We considered four germline and six somatic commercially available panel tests and quantified the coverage of germline genes across them. Comparison between the number of genes that had germline importance and the number of genes included in somatic testing showed that many but not all germline genes are tested by frequently used somatic panels. CONCLUSION The inclusion of cancer-predisposing genes in somatic variant testing panels makes incidental germline findings likely. Although somatic testing can be used to screen for germline variants, this strategy is inadequate for comprehensive screening. Access to genetic counseling is essential for interpretation of germline implications of somatic testing and implementation of appropriate screening and follow-up.


2021 ◽  
Vol 12 ◽  
Author(s):  
Hua Zhu ◽  
Xinyao Hu ◽  
Yingze Ye ◽  
Zhihong Jian ◽  
Yi Zhong ◽  
...  

Phosphatidylinositol binding clathrin assembly protein interacting mitotic regulator (PIMREG) localizes to the nucleus and can significantly elevate the nuclear localization of clathrin assembly lymphomedullary leukocythemia gene. Although there is some evidence to support an important action for PIMREG in the occurrence and development of certain cancers, currently no pan-cancer analysis of PIMREG is available. Therefore, we intended to estimate the prognostic predictive value of PIMREG and to explore its potential immune function in 33 cancer types. By using a series of bioinformatics approaches, we extracted and analyzed datasets from Oncomine, The Cancer Genome Atlas, Cancer Cell Lineage Encyclopedia (CCLE) and the Human Protein Atlas (HPA), to explore the underlying carcinogenesis of PIMREG, including relevance of PIMREG to prognosis, microsatellite instability (MSI), tumor mutation burden (TMB), tumor microenvironment (TME) and infiltration of immune cells in various types of cancer. Our findings indicate that PIMREG is highly expressed in at least 24 types of cancer, and is negatively correlated with prognosis in major cancer types. In addition, PIMREG expression was correlated with TMB in 24 cancers and with MSI in 10 cancers. We revealed that PIMREG is co-expressed with genes encoding major histocompatibility complex, immune activation, immune suppression, chemokine and chemokine receptors. We also found that the different roles of PIMREG in the infiltration of different immune cell types in different tumors. PIMREG can potentially influence the etiology or pathogenesis of cancer by acting on immune-related pathways, chemokine signaling pathway, regulation of autophagy, RIG-I like receptor signaling pathway, antigen processing and presentation, FC epsilon RI pathway, complement and coagulation cascades, T cell receptor pathway, NK cell mediated cytotoxicity and other immune-related pathways. Our study suggests that PIMREG can be applied as a prognostic marker in a variety of malignancies because of its role in tumorigenesis and immune infiltration.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e15074-e15074
Author(s):  
Yamin Zhang ◽  
Zilin Cui ◽  
Rui Shi ◽  
Xiaolong Liu ◽  
Yang Li ◽  
...  

e15074 Background: CDK4/6 kinases associate with cyclin D proteins during transition from G1 to S phase of the cell cycle. Amplification of CDK4/6 may elicit the activity of cyclin D, which hyperphosphorylates RB, ultimately leading to uncontrolled cell proliferation. Currently, three CDK4/6 inhibitors are used in breast cancer, ovarian cancer and sarcoma. Herein, we investigate the prevalence of CDK4/6 amplification in Chinese and Western cancer patients, hope to find more cancer subtypes with CDK4/6 amplification. Methods: Next-generation sequencing data and clinical data were collected from 10828 TCGA pan-cancer patients (Western cohort). A 539-gene panel targeted sequencing assay was performed on FFPE tumor samples from 4181 Chinese pan-cancer patients (Chinese cohort). CDK4 and CDK6 amplification were calculated on the two cohorts following the same criteria. Results: In total, 182 (4.4%) of the 4181 Chinese patients and 529 (4.9%) of the 10828 Western patients had CDK4 amplification, 133 (3.2%) of the 4181 Chinese patients and 475 (4.4%) of the 10828 Western patients had CDK6 amplification. In Western cohort, the top 5 CDK4 amplification-associated cancer types were sarcoma, glioblastoma multiforme, lung adenocarcinoma, ovarian carcinoma, and adrenocortical carcinoma, and the top 5 CDK6 amplification-associated cancer types were esophageal carcinoma, ovarian carcinoma, lung squamous cell carcinoma, stomach adenocarcinoma, sarcoma. In Chinese cohort, the top 5 CDK4 amplification-associated cancer types were lung adenocarcinoma, melanoma, sarcoma, stomach carcinoma, liver cancer, and the top 5 CDK6 amplification-associated cancer types were lung adenocarcinoma, stomach carcinoma, liver cancer, melanoma, glioma. In addition, CDK4 amplification in Chinese cohort, 22 (11%) of the 203 Chinese bone and soft tissue sarcoma patients had CDK4 amplification, and 4 (2%) of the 203 had CDK6 amplification. Bone and soft tissue sarcoma types with CDK4 / 6 amplification including soft tissue sarcoma, bone cancer, fibrosarcoma, chondrosarcoma, rhabdomyosarcoma, liposarcoma, synovial sarcoma. Conclusions: Our study provided a characteristic of CDK4/6 amplification in Chinese and Western pan-cancer patients. Analysis revealed frequent CDK4 / 6 amplification in lung cancer, sarcoma, stomach carcinoma, ovarian carcinoma and liver cancer. It is suggested patient with these cancer types may potentially benefit from CDK4/6 inhibitor.


2020 ◽  
Vol 21 (17) ◽  
pp. 6087
Author(s):  
Yunzhen Wei ◽  
Limeng Zhou ◽  
Yingzhang Huang ◽  
Dianjing Guo

Long noncoding RNA (lncRNA)/microRNA(miRNA)/mRNA triplets contribute to cancer biology. However, identifying significative triplets remains a major challenge for cancer research. The dynamic changes among factors of the triplets have been less understood. Here, by integrating target information and expression datasets, we proposed a novel computational framework to identify the triplets termed as “lncRNA-perturbated triplets”. We applied the framework to five cancer datasets in The Cancer Genome Atlas (TCGA) project and identified 109 triplets. We showed that the paired miRNAs and mRNAs were widely perturbated by lncRNAs in different cancer types. LncRNA perturbators and lncRNA-perturbated mRNAs showed significantly higher evolutionary conservation than other lncRNAs and mRNAs. Importantly, the lncRNA-perturbated triplets exhibited high cancer specificity. The pan-cancer perturbator OIP5-AS1 had higher expression level than that of the cancer-specific perturbators. These lncRNA perturbators were significantly enriched in known cancer-related pathways. Furthermore, among the 25 lncRNA in the 109 triplets, lncRNA SNHG7 was identified as a stable potential biomarker in lung adenocarcinoma (LUAD) by combining the TCGA dataset and two independent GEO datasets. Results from cell transfection also indicated that overexpression of lncRNA SNHG7 and TUG1 enhanced the expression of the corresponding mRNA PNMA2 and CDC7 in LUAD. Our study provides a systematic dissection of lncRNA-perturbated triplets and facilitates our understanding of the molecular roles of lncRNAs in cancers.


2019 ◽  
pp. 1-15
Author(s):  
Karen A. Cadoo ◽  
Diana L. Mandelker ◽  
Semanti Mukherjee ◽  
Carolyn Stewart ◽  
Deborah DeLair ◽  
...  

PURPOSE Mutations in DNA mismatch repair genes and PTEN, diagnostic of Lynch and Cowden syndromes, respectively, represent the only established inherited predisposition genes in endometrial cancer to date. The prevalence of other cancer predisposition genes remains unclear. We determined the prevalence of pathogenic germline variants in unselected patients with endometrial cancer scheduled for surgical consultation. PATIENTS AND METHODS Patients prospectively consented (April 2016 to May 2017) to an institutional review board–approved protocol of tumor-normal sequencing via a custom next-generation sequencing panel—the Memorial Sloan Kettering–Integrated Mutation Profiling of Actionable Cancer Targets—that yielded germline results for more than 75 cancer predisposition genes. Tumors were assessed for microsatellite instability. Per institutional standards, all tumors underwent Lynch syndrome screening via immunohistochemistry (IHC) for mismatch repair proteins. RESULTS Of 156 patients who consented to germline genetic testing, 118 (76%) had stage I disease. In 104 patients (67%), tumors were endometrioid, and 60 (58%) of those tumors were grade 1. Twenty-four pathogenic germline variants were identified in 22 patients (14%): seven (4.5%) had highly penetrant cancer syndromes and 15 (9.6%) had variants in low-penetrance, moderate-penetrance, or recessive genes. Of these, five (21%) were in Lynch syndrome genes (two MSH6, two PMS2, and one MLH1). All five tumors had concordant IHC staining; two (40%) were definitively microsatellite instability–high by next-generation sequencing. One patient had a known BRCA1 mutation, and one had an SMARCA4 deletion. The remaining 17 variants (71%) were incremental findings in low- and moderate-penetrance variants or genes associated with recessive disease. CONCLUSION In unselected patients with predominantly low-risk, early-stage endometrial cancer, germline multigene panel testing identified cancer predisposition gene variants in 14%. This finding may have implications for future cancer screening and risk-reduction recommendations. Universal IHC screening for Lynch syndrome successfully identifies the majority (71%) of high-penetrance germline mutations.


Sign in / Sign up

Export Citation Format

Share Document