scholarly journals Robust Cancer Mutation Detection with Deep Learning Models Derived from Tumor-Normal Sequencing Data

2019 ◽  
Author(s):  
Sayed Mohammad Ebrahim Sahraeian ◽  
Li Tai Fang ◽  
Marghoob Mohiyuddin ◽  
Huixiao Hong ◽  
Wenming Xiao

AbstractAccurate detection of somatic mutations is challenging but critical to the understanding of cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network based somatic mutation detection approach and demonstrated performance advantages on in silico data. In this study, we used the first comprehensive and well-characterized somatic reference samples from the SEQC-II consortium to investigate best practices for utilizing deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for these reference samples by the consortium, we identified strategies for building robust models on multiple datasets derived from samples representing real scenarios. The proposed strategies achieved high robustness across multiple sequencing technologies such as WGS, WES, AmpliSeq target sequencing for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages (ranging from 10× - 2000×). NeuSomatic significantly outperformed conventional detection approaches in general, as well as in challenging situations such as low coverage, low mutation frequency, DNA damage, and difficult genomic regions.

2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Sayed Mohammad Ebrahim Sahraeian ◽  
Li Tai Fang ◽  
Konstantinos Karagiannis ◽  
Malcolm Moos ◽  
Sean Smith ◽  
...  

Abstract Background Accurate detection of somatic mutations is challenging but critical in understanding cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network-based somatic mutation detection approach, and demonstrated performance advantages on in silico data. Results In this study, we use the first comprehensive and well-characterized somatic reference data sets from the SEQC2 consortium to investigate best practices for using a deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for a cancer cell line by the consortium, we identify the best strategy for building robust models on multiple data sets derived from samples representing real scenarios, for example, a model trained on a combination of real and spike-in mutations had the highest average performance. Conclusions The strategy identified in our study achieved high robustness across multiple sequencing technologies for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages, with significant superiority over conventional detection approaches in general, as well as in challenging situations such as low coverage, low variant allele frequency, DNA damage, and difficult genomic regions


2017 ◽  
Author(s):  
Jia-Xing Yue ◽  
Gianni Liti

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.


2020 ◽  
Author(s):  
Prashant Gupta ◽  
Aashi Jindal ◽  
Jayadeva ◽  
Debarka Sengupta

ABSTRACTThe exclusivity of a vast majority of cancer mutations remains poorly understood, despite the availability of large amounts of whole genome and exome sequencing data. In clinical settings, this markedly hinders the identification of the previously uncharacterized deleterious mutations due to the unavailability of matched normal samples. We employed state of the art deep learning algorithms for cross-exome learning of mutational embeddings and demonstrated their utility in sequence based detection of cancer-specific Single Nucleotide Variants (SNVs).


2021 ◽  
Vol 39 (9) ◽  
pp. 1151-1160
Author(s):  
Li Tai Fang ◽  
Bin Zhu ◽  
Yongmei Zhao ◽  
Wanqiu Chen ◽  
Zhaowei Yang ◽  
...  

2021 ◽  
Vol 22 (S4) ◽  
Author(s):  
Zexian Zeng ◽  
Chengsheng Mao ◽  
Andy Vo ◽  
Xiaoyu Li ◽  
Janna Ore Nugent ◽  
...  

Abstract Background Genetic information is becoming more readily available and is increasingly being used to predict patient cancer types as well as their subtypes. Most classification methods thus far utilize somatic mutations as independent features for classification and are limited by study power. We aim to develop a novel method to effectively explore the landscape of genetic variants, including germline variants, and small insertions and deletions for cancer type prediction. Results We proposed DeepCues, a deep learning model that utilizes convolutional neural networks to unbiasedly derive features from raw cancer DNA sequencing data for disease classification and relevant gene discovery. Using raw whole-exome sequencing as features, germline variants and somatic mutations, including insertions and deletions, were interactively amalgamated for feature generation and cancer prediction. We applied DeepCues to a dataset from TCGA to classify seven different types of major cancers and obtained an overall accuracy of 77.6%. We compared DeepCues to conventional methods and demonstrated a significant overall improvement (p < 0.001). Strikingly, using DeepCues, the top 20 breast cancer relevant genes we have identified, had a 40% overlap with the top 20 known breast cancer driver genes. Conclusion Our results support DeepCues as a novel method to improve the representational resolution of DNA sequencings and its power in deriving features from raw sequences for cancer type prediction, as well as discovering new cancer relevant genes.


2019 ◽  
Author(s):  
Lesley M Chapman ◽  
Noah Spies ◽  
Patrick Pai ◽  
Chun Shen Lim ◽  
Andrew Carroll ◽  
...  

AbstractA high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is yet to be defined. In this study, we manually curated 1235 SVs which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app – SVCurator – to help curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy.SVCurator is a Python Flask-based web platform that displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002], We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. The crowdsourced results were highly concordant with 37 out of the 61 curators having at least 78% concordance with a set of ‘expert’ curators, where there was 93% concordance amongst ‘expert’ curators. This produced high confidence labels for 935 events. When compared to the heuristic-based draft benchmark SV callset from GIAB, the SVCurator crowdsourced labels were 94.5% concordant with the benchmark set. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.


2016 ◽  
Author(s):  
Remi Torracinta ◽  
Laurent Mesnard ◽  
Susan Levine ◽  
Rita Shaknovich ◽  
Maureen Hanson ◽  
...  

ABSTRACTA number of approaches have been developed to call somatic variation in high-throughput sequencing data. Here, we present an adaptive approach to calling somatic variations. Our approach trains a deep feed-forward neural network with semi-simulated data. Semi-simulated datasets are constructed by planting somatic mutations in real datasets where no mutations are expected. Using semi-simulated data makes it possible to train the models with millions of training examples, a usual requirement for successfully training deep learning models. We initially focus on calling variations in RNA-Seq data. We derive semi-simulated datasets from real RNA-Seq data, which offer a good representation of the data the models will be applied to. We test the models on independent semi-simulated data as well as pure simulations. On independent semi-simulated data, models achieve an AUC of 0.973. When tested on semi-simulated exome DNA datasets, we find that the models trained on RNA-Seq data remain predictive (sens 0.4 & spec 0.9 at cutoff of P > = 0.9), albeit with lower overall performance (AUC=0.737). Interestingly, while the models generalize across assay, training on RNA-Seq data lowers the confidence for a group of mutations. Haloplex exome specific training was also performed, demonstrating that the approach can produce probabilistic models tuned for specific assays and protocols. We found that the method adapts to the characteristics of experimental protocol. We further illustrate these points by training a model for a trio somatic experimental design when germline DNA of both parents is available in addition to data about the individual. These models are distributed with Goby (http://goby.campagnelab.org).


2019 ◽  
Author(s):  
Zexian Zeng ◽  
Chengsheng Mao ◽  
Andy Vo ◽  
Janna Ore Nugent ◽  
Seema A Khan ◽  
...  

ABSTRACTGenetic information is becoming more readily available and is increasingly being used to predict patient cancer types as well as their subtypes. Most classification methods thus far utilize somatic mutations as independent features for classification and are limited by study power. To address these limitations, we propose DeepCues, a deep learning model that utilizes convolutional neural networks to derive features from DNA sequencing data for disease classification and relevant gene discovery. Using whole-exome sequencing, germline variants and somatic mutations, including insertions and deletions, are interactively amalgamated as features. In this study, we applied DeepCues to a dataset from TCGA to classify seven different types of major cancers and obtained an overall accuracy of 77.6%. We compared DeepCues to conventional methods and demonstrated a significant overall improvement (p=8.8E-25). Using DeepCues, we found that the top 20 genes associated with breast cancer have a 40% overlap with the top 20 breast cancer genes in the COSMIC database. These data support DeepCues as a novel method to improve the representational resolution of both germline variants and somatic mutations interactively and their power in predicting cancer types, as well the genes involved in each cancer.


2021 ◽  
Vol 12 ◽  
Author(s):  
Xin Zeng ◽  
Sung-Joon Park ◽  
Kenta Nakai

Promoters and enhancers are well-known regulatory elements modulating gene expression. As confirmed by high-throughput sequencing technologies, these regulatory elements are bidirectionally transcribed. That is, promoters produce stable mRNA in the sense direction and unstable RNA in the antisense direction, while enhancers transcribe unstable RNA in both directions. Although it is thought that enhancers and promoters share a similar architecture of transcription start sites (TSSs), how the transcriptional machinery distinctly uses these genomic regions as promoters or enhancers remains unclear. To address this issue, we developed a deep learning (DL) method by utilizing a convolutional neural network (CNN) and the saliency algorithm. In comparison with other classifiers, our CNN presented higher predictive performance, suggesting the overarching importance of the high-order sequence features, captured by the CNN. Moreover, our method revealed that there are substantial sequence differences between the enhancers and promoters. Remarkably, the 20–120 bp downstream regions from the center of bidirectional TSSs seemed to contribute to the RNA stability. These regions in promoters tend to have a larger number of guanines and cytosines compared to those in enhancers, and this feature contributed to the classification of the regulatory elements. Our CNN-based method can capture the complex TSS architectures. We found that the genomic regions around TSSs for promoters and enhancers contribute to RNA stability and show GC-biased characteristics as a critical determinant for promoter TSSs.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Chong Chu ◽  
Rebeca Borges-Monroy ◽  
Vinayak V. Viswanadham ◽  
Soohyun Lee ◽  
Heng Li ◽  
...  

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.


Sign in / Sign up

Export Citation Format

Share Document