Robust Cancer Mutation Detection with Deep Learning Models Derived from Tumor-Normal Sequencing Data

Mapping Intimacies ◽

10.1101/667261 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sayed Mohammad Ebrahim Sahraeian ◽

Li Tai Fang ◽

Marghoob Mohiyuddin ◽

Huixiao Hong ◽

Wenming Xiao

Keyword(s):

Deep Learning ◽

Somatic Mutations ◽

Mutation Detection ◽

Sequencing Data ◽

Target Sequencing ◽

Sequencing Technologies ◽

Cancer Mutation ◽

Detection Approach ◽

Genomic Regions ◽

Reference Samples

AbstractAccurate detection of somatic mutations is challenging but critical to the understanding of cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network based somatic mutation detection approach and demonstrated performance advantages on in silico data. In this study, we used the first comprehensive and well-characterized somatic reference samples from the SEQC-II consortium to investigate best practices for utilizing deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for these reference samples by the consortium, we identified strategies for building robust models on multiple datasets derived from samples representing real scenarios. The proposed strategies achieved high robustness across multiple sequencing technologies such as WGS, WES, AmpliSeq target sequencing for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages (ranging from 10× - 2000×). NeuSomatic significantly outperformed conventional detection approaches in general, as well as in challenging situations such as low coverage, low mutation frequency, DNA damage, and difficult genomic regions.

Download Full-text

Achieving robust somatic mutation detection with deep learning models derived from reference data sets of a cancer sample

Genome Biology ◽

10.1186/s13059-021-02592-9 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Sayed Mohammad Ebrahim Sahraeian ◽

Li Tai Fang ◽

Konstantinos Karagiannis ◽

Malcolm Moos ◽

Sean Smith ◽

...

Keyword(s):

Deep Learning ◽

Somatic Mutation ◽

Somatic Mutations ◽

Mutation Detection ◽

Reference Data ◽

Cancer Cell Line ◽

Data Sets ◽

Sequencing Technologies ◽

Multiple Data Sets ◽

Somatic Mutation Detection

Abstract Background Accurate detection of somatic mutations is challenging but critical in understanding cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network-based somatic mutation detection approach, and demonstrated performance advantages on in silico data. Results In this study, we use the first comprehensive and well-characterized somatic reference data sets from the SEQC2 consortium to investigate best practices for using a deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for a cancer cell line by the consortium, we identify the best strategy for building robust models on multiple data sets derived from samples representing real scenarios, for example, a model trained on a combination of real and spike-in mutations had the highest average performance. Conclusions The strategy identified in our study achieved high robustness across multiple sequencing technologies for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages, with significant superiority over conventional detection approaches in general, as well as in challenging situations such as low coverage, low variant allele frequency, DNA damage, and difficult genomic regions

Download Full-text

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

10.1101/184572 ◽

2017 ◽

Author(s):

Jia-Xing Yue ◽

Gianni Liti

Keyword(s):

Genome Assembly ◽

Model Organism ◽

Sequencing Data ◽

Protein Coding ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Downstream Analysis ◽

Eukaryotic Organisms ◽

Genomic Regions

AbstractLong-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads. However, analyzing long-read sequencing data to produce high-quality genome assembly and annotation remains challenging. Here we present LRSDAY, the first one-stop solution to streamline this process. LRSDAY can produce chromosome-level end-to-end genome assembly and comprehensive annotations for various genomic features (including centromeres, protein-coding genes, tRNAs, transposable elements and telomere-associated elements) that are ready for downstream analysis. Although tailored for S. cerevisiae, we designed LRSDAY to be highly modular and customizable, making it adaptable for virtually any eukaryotic organisms. Applying LRSDAY to a S. cerevisiae strain takes ∼43 hrs to generate a complete and well-annotated genome from ∼100X Pacific Biosciences (PacBio) reads using four threads.

Download Full-text

Deep learning discerns cancer mutation exclusivity

10.1101/2020.04.09.022731 ◽

2020 ◽

Author(s):

Prashant Gupta ◽

Aashi Jindal ◽

Jayadeva ◽

Debarka Sengupta

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Clinical Settings ◽

Deleterious Mutations ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Exome Sequencing Data ◽

Cancer Mutation ◽

Cancer Mutations

ABSTRACTThe exclusivity of a vast majority of cancer mutations remains poorly understood, despite the availability of large amounts of whole genome and exome sequencing data. In clinical settings, this markedly hinders the identification of the previously uncharacterized deleterious mutations due to the unavailability of matched normal samples. We employed state of the art deep learning algorithms for cross-exome learning of mutational embeddings and demonstrated their utility in sequence based detection of cancer-specific Single Nucleotide Variants (SNVs).

Download Full-text

Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing

Nature Biotechnology ◽

10.1038/s41587-021-00993-6 ◽

2021 ◽

Vol 39 (9) ◽

pp. 1151-1160

Author(s):

Li Tai Fang ◽

Bin Zhu ◽

Yongmei Zhao ◽

Wanqiu Chen ◽

Zhaowei Yang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Mutation Detection ◽

Whole Genome ◽

Cancer Mutation ◽

Reference Samples

Download Full-text

Deep learning for cancer type classification and driver gene identification

BMC Bioinformatics ◽

10.1186/s12859-021-04400-4 ◽

2021 ◽

Vol 22 (S4) ◽

Author(s):

Zexian Zeng ◽

Chengsheng Mao ◽

Andy Vo ◽

Xiaoyu Li ◽

Janna Ore Nugent ◽

...

Keyword(s):

Breast Cancer ◽

Deep Learning ◽

Somatic Mutations ◽

Disease Classification ◽

Driver Gene ◽

Cancer Type ◽

Sequencing Data ◽

Germline Variants ◽

Insertions And Deletions ◽

Novel Method

Abstract Background Genetic information is becoming more readily available and is increasingly being used to predict patient cancer types as well as their subtypes. Most classification methods thus far utilize somatic mutations as independent features for classification and are limited by study power. We aim to develop a novel method to effectively explore the landscape of genetic variants, including germline variants, and small insertions and deletions for cancer type prediction. Results We proposed DeepCues, a deep learning model that utilizes convolutional neural networks to unbiasedly derive features from raw cancer DNA sequencing data for disease classification and relevant gene discovery. Using raw whole-exome sequencing as features, germline variants and somatic mutations, including insertions and deletions, were interactively amalgamated for feature generation and cancer prediction. We applied DeepCues to a dataset from TCGA to classify seven different types of major cancers and obtained an overall accuracy of 77.6%. We compared DeepCues to conventional methods and demonstrated a significant overall improvement (p < 0.001). Strikingly, using DeepCues, the top 20 breast cancer relevant genes we have identified, had a 40% overlap with the top 20 known breast cancer driver genes. Conclusion Our results support DeepCues as a novel method to improve the representational resolution of DNA sequencings and its power in deriving features from raw sequences for cancer type prediction, as well as discovering new cancer relevant genes.

Download Full-text

SVCurator: A Crowdsourcing app to visualize evidence of structural variants for the human genome

10.1101/581264 ◽

2019 ◽

Cited By ~ 3

Author(s):

Lesley M Chapman ◽

Noah Spies ◽

Patrick Pai ◽

Chun Shen Lim ◽

Andrew Carroll ◽

...

Keyword(s):

Human Genome ◽

Reference Genome ◽

Ashkenazi Jewish ◽

Structural Variants ◽

Sequencing Data ◽

Sequencing Technologies ◽

Size Accuracy ◽

Large Indels ◽

Web Platform ◽

Reference Samples

AbstractA high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is yet to be defined. In this study, we manually curated 1235 SVs which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app – SVCurator – to help curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy.SVCurator is a Python Flask-based web platform that displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002], We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. The crowdsourced results were highly concordant with 37 out of the 61 curators having at least 78% concordance with a set of ‘expert’ curators, where there was 93% concordance amongst ‘expert’ curators. This produced high confidence labels for 935 events. When compared to the heuristic-based draft benchmark SV callset from GIAB, the SVCurator crowdsourced labels were 94.5% concordant with the benchmark set. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.

Download Full-text

Adaptive Somatic Mutations Calls with Deep Learning and Semi-Simulated Data

10.1101/079087 ◽

2016 ◽

Cited By ~ 6

Author(s):

Remi Torracinta ◽

Laurent Mesnard ◽

Susan Levine ◽

Rita Shaknovich ◽

Maureen Hanson ◽

...

Keyword(s):

Deep Learning ◽

High Throughput Sequencing ◽

Probabilistic Models ◽

Somatic Mutations ◽

Simulated Data ◽

Good Representation ◽

Rna Seq ◽

Sequencing Data ◽

Somatic Variation ◽

Feed Forward Neural Network

ABSTRACTA number of approaches have been developed to call somatic variation in high-throughput sequencing data. Here, we present an adaptive approach to calling somatic variations. Our approach trains a deep feed-forward neural network with semi-simulated data. Semi-simulated datasets are constructed by planting somatic mutations in real datasets where no mutations are expected. Using semi-simulated data makes it possible to train the models with millions of training examples, a usual requirement for successfully training deep learning models. We initially focus on calling variations in RNA-Seq data. We derive semi-simulated datasets from real RNA-Seq data, which offer a good representation of the data the models will be applied to. We test the models on independent semi-simulated data as well as pure simulations. On independent semi-simulated data, models achieve an AUC of 0.973. When tested on semi-simulated exome DNA datasets, we find that the models trained on RNA-Seq data remain predictive (sens 0.4 & spec 0.9 at cutoff of P > = 0.9), albeit with lower overall performance (AUC=0.737). Interestingly, while the models generalize across assay, training on RNA-Seq data lowers the confidence for a group of mutations. Haloplex exome specific training was also performed, demonstrating that the approach can produce probabilistic models tuned for specific assays and protocols. We found that the method adapts to the characteristics of experimental protocol. We further illustrate these points by training a model for a trio somatic experimental design when germline DNA of both parents is available in addition to data about the individual. These models are distributed with Goby (http://goby.campagnelab.org).

Download Full-text

Deep learning for cancer type classification

10.1101/612762 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zexian Zeng ◽

Chengsheng Mao ◽

Andy Vo ◽

Janna Ore Nugent ◽

Seema A Khan ◽

...

Keyword(s):

Breast Cancer ◽

Deep Learning ◽

Somatic Mutations ◽

Disease Classification ◽

Cancer Type ◽

Cancer Genes ◽

Sequencing Data ◽

Germline Variants ◽

Cancer Types ◽

Independent Features

ABSTRACTGenetic information is becoming more readily available and is increasingly being used to predict patient cancer types as well as their subtypes. Most classification methods thus far utilize somatic mutations as independent features for classification and are limited by study power. To address these limitations, we propose DeepCues, a deep learning model that utilizes convolutional neural networks to derive features from DNA sequencing data for disease classification and relevant gene discovery. Using whole-exome sequencing, germline variants and somatic mutations, including insertions and deletions, are interactively amalgamated as features. In this study, we applied DeepCues to a dataset from TCGA to classify seven different types of major cancers and obtained an overall accuracy of 77.6%. We compared DeepCues to conventional methods and demonstrated a significant overall improvement (p=8.8E-25). Using DeepCues, we found that the top 20 genes associated with breast cancer have a 40% overlap with the top 20 breast cancer genes in the COSMIC database. These data support DeepCues as a novel method to improve the representational resolution of both germline variants and somatic mutations interactively and their power in predicting cancer types, as well the genes involved in each cancer.

Download Full-text

Characterizing Promoter and Enhancer Sequences by a Deep Learning Method

Frontiers in Genetics ◽

10.3389/fgene.2021.681259 ◽

2021 ◽

Vol 12 ◽

Author(s):

Xin Zeng ◽

Sung-Joon Park ◽

Kenta Nakai

Keyword(s):

Deep Learning ◽

High Throughput Sequencing ◽

Rna Stability ◽

Predictive Performance ◽

Regulatory Elements ◽

Critical Determinant ◽

Transcription Start Sites ◽

Sequencing Technologies ◽

Genomic Regions ◽

Order Sequence

Promoters and enhancers are well-known regulatory elements modulating gene expression. As confirmed by high-throughput sequencing technologies, these regulatory elements are bidirectionally transcribed. That is, promoters produce stable mRNA in the sense direction and unstable RNA in the antisense direction, while enhancers transcribe unstable RNA in both directions. Although it is thought that enhancers and promoters share a similar architecture of transcription start sites (TSSs), how the transcriptional machinery distinctly uses these genomic regions as promoters or enhancers remains unclear. To address this issue, we developed a deep learning (DL) method by utilizing a convolutional neural network (CNN) and the saliency algorithm. In comparison with other classifiers, our CNN presented higher predictive performance, suggesting the overarching importance of the high-order sequence features, captured by the CNN. Moreover, our method revealed that there are substantial sequence differences between the enhancers and promoters. Remarkably, the 20–120 bp downstream regions from the center of bidirectional TSSs seemed to contribute to the RNA stability. These regions in promoters tend to have a larger number of guanines and cytosines compared to those in enhancers, and this feature contributed to the classification of the regulatory elements. Our CNN-based method can capture the complex TSS architectures. We found that the genomic regions around TSSs for promoters and enhancers contribute to RNA stability and show GC-biased characteristics as a critical determinant for promoter TSSs.

Download Full-text

Comprehensive identification of transposable element insertions using multiple sequencing technologies

Nature Communications ◽

10.1038/s41467-021-24041-8 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chong Chu ◽

Rebeca Borges-Monroy ◽

Vinayak V. Viswanadham ◽

Soohyun Lee ◽

Heng Li ◽

...

Keyword(s):

Transposable Element ◽

Structure And Function ◽

Endogenous Retroviruses ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Short Read ◽

Sequencing Technologies ◽

Long Read ◽

And Function

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.

Download Full-text