scholarly journals Adaptive Somatic Mutations Calls with Deep Learning and Semi-Simulated Data

2016 ◽  
Author(s):  
Remi Torracinta ◽  
Laurent Mesnard ◽  
Susan Levine ◽  
Rita Shaknovich ◽  
Maureen Hanson ◽  
...  

ABSTRACTA number of approaches have been developed to call somatic variation in high-throughput sequencing data. Here, we present an adaptive approach to calling somatic variations. Our approach trains a deep feed-forward neural network with semi-simulated data. Semi-simulated datasets are constructed by planting somatic mutations in real datasets where no mutations are expected. Using semi-simulated data makes it possible to train the models with millions of training examples, a usual requirement for successfully training deep learning models. We initially focus on calling variations in RNA-Seq data. We derive semi-simulated datasets from real RNA-Seq data, which offer a good representation of the data the models will be applied to. We test the models on independent semi-simulated data as well as pure simulations. On independent semi-simulated data, models achieve an AUC of 0.973. When tested on semi-simulated exome DNA datasets, we find that the models trained on RNA-Seq data remain predictive (sens 0.4 & spec 0.9 at cutoff of P > = 0.9), albeit with lower overall performance (AUC=0.737). Interestingly, while the models generalize across assay, training on RNA-Seq data lowers the confidence for a group of mutations. Haloplex exome specific training was also performed, demonstrating that the approach can produce probabilistic models tuned for specific assays and protocols. We found that the method adapts to the characteristics of experimental protocol. We further illustrate these points by training a model for a trio somatic experimental design when germline DNA of both parents is available in addition to data about the individual. These models are distributed with Goby (http://goby.campagnelab.org).

2016 ◽  
Author(s):  
Fabien Campagne

ABSTRACTIn http://dx.doi.org/10.1101/079087, we presented adaptive models for calling somatic mutations in high-throughput sequencing data. These models were developed by training deep neural networks with semi-simulated data. In this continuation, I evaluate how such models can predict known somatic mutations in a real dataset. To address this question, I tested the approach using samples from the International Cancer Genome Consortium (ICGC) and the previously published ground-truth mutations (GoldSet). This evaluation revealed that training models with semi-simulation does produce models that exhibit strong performance in real datasets. I found a linear relationship between the performance observed on a semi-simulated validation set and independent ground-truth in the gold set (R2 = 0.952, P < 2−16). I also found that semi-simulation can be used to pre-train models before continuing training with true labels and that this pre-training improves model performance substantially on the real dataset compared to training models only with the real dataset. The best model pre-trained with semi-simulation achieved an AUC of 0.969 [0.957-0.982] (95% confidence interval) compared to 0.911 [0.890-0.932] when training with real labels only. These data demonstrate that semi-simulation can be a very effective approach to training filtering and ranking probabilistic models.


2015 ◽  
Vol 2015 ◽  
pp. 1-5 ◽  
Author(s):  
Yuxiang Tan ◽  
Yann Tambouret ◽  
Stefano Monti

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.


2021 ◽  
Author(s):  
Jiaqi Li ◽  
Lei Wei ◽  
Xianglin Zhang ◽  
Wei Zhang ◽  
Haochen Wang ◽  
...  

ABSTRACTDetecting cancer signals in cell-free DNA (cfDNA) high-throughput sequencing data is emerging as a novel non-invasive cancer detection method. Due to the high cost of sequencing, it is crucial to make robust and precise prediction with low-depth cfDNA sequencing data. Here we propose a novel approach named DISMIR, which can provide ultrasensitive and robust cancer detection by integrating DNA sequence and methylation information in plasma cfDNA whole genome bisulfite sequencing (WGBS) data. DISMIR introduces a new feature termed as “switching region” to define cancer-specific differentially methylated regions, which can enrich the cancer-related signal at read-resolution. DISMIR applies a deep learning model to predict the source of every single read based on its DNA sequence and methylation state, and then predicts the risk that the plasma donor is suffering from cancer. DISMIR exhibited high accuracy and robustness on hepatocellular carcinoma detection by plasma cfDNA WGBS data even at ultra-low sequencing depths. Analysis showed that DISMIR tends to be insensitive to alterations of single CpG sites’ methylation states, which suggests DISMIR could resist to technical noise of WGBS. All these results showed DISMIR with the potential to be a precise and robust method for low-cost early cancer detection.


2014 ◽  
Author(s):  
Simon Anders ◽  
Paul Theodor Pyl ◽  
Wolfgang Huber

Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard work flows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data such as genomic coordinates, sequences, sequencing reads, alignments, gene model information, variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability: HTSeq is released as open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index, https://pypi.python.org/pypi/HTSeq


2015 ◽  
Author(s):  
Rahul Reddy

As RNA-Seq and other high-throughput sequencing grow in use and remain critical for gene expression studies, technical variability in counts data impedes studies of differential expression studies, data across samples and experiments, or reproducing results. Studies like Dillies et al. (2013) compare several between-lane normalization methods involving scaling factors, while Hansen et al. (2012) and Risso et al. (2014) propose methods that correct for sample-specific bias or use sets of control genes to isolate and remove technical variability. This paper evaluates four normalization methods in terms of reducing intra-group, technical variability and facilitating differential expression analysis or other research where the biological, inter-group variability is of interest. To this end, the four methods were evaluated in differential expression analysis between data from Pickrell et al. (2010) and Montgomery et al. (2010) and between simulated data modeled on these two datasets. Though the between-lane scaling factor methods perform worse on real data sets, they are much stronger for simulated data. We cannot reject the recommendation of Dillies et al. to use TMM and DESeq normalization, but further study of power to detect effects of different size under each normalization method is merited.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Dat Thanh Nguyen ◽  
Quang Thinh Trac ◽  
Thi-Hau Nguyen ◽  
Ha-Nam Nguyen ◽  
Nir Ohad ◽  
...  

Abstract Background Circular RNA (circRNA) is an emerging class of RNA molecules attracting researchers due to its potential for serving as markers for diagnosis, prognosis, or therapeutic targets of cancer, cardiovascular, and autoimmune diseases. Current methods for detection of circRNA from RNA sequencing (RNA-seq) focus mostly on improving mapping quality of reads supporting the back-splicing junction (BSJ) of a circRNA to eliminate false positives (FPs). We show that mapping information alone often cannot predict if a BSJ-supporting read is derived from a true circRNA or not, thus increasing the rate of FP circRNAs. Results We have developed Circall, a novel circRNA detection method from RNA-seq. Circall controls the FPs using a robust multidimensional local false discovery rate method based on the length and expression of circRNAs. It is computationally highly efficient by using a quasi-mapping algorithm for fast and accurate RNA read alignments. We applied Circall on two simulated datasets and three experimental datasets of human cell-lines. The results show that Circall achieves high sensitivity and precision in the simulated data. In the experimental datasets it performs well against current leading methods. Circall is also substantially faster than the other methods, particularly for large datasets. Conclusions With those better performances in the detection of circRNAs and in computational time, Circall facilitates the analyses of circRNAs in large numbers of samples. Circall is implemented in C++ and R, and available for use at https://www.meb.ki.se/sites/biostatwiki/circall and https://github.com/datngu/Circall.


2019 ◽  
Author(s):  
Ayman Yousif ◽  
Nizar Drou ◽  
Jillian Rowe ◽  
Mohammed Khalfan ◽  
Kristin C Gunsalus

AbstractBackgroundAs high-throughput sequencing applications continue to evolve, the rapid growth in quantity and variety of sequence-based data calls for the development of new software libraries and tools for data analysis and visualization. Often, effective use of these tools requires computational skills beyond those of many researchers. To ease this computational barrier, we have created a dynamic web-based platform, NASQAR (Nucleic Acid SeQuence Analysis Resource).ResultsNASQAR offers a collection of custom and publicly available open-source web applications that make extensive use of a variety of R packages to provide interactive data analysis and visualization. The platform is publicly accessible at http://nasqar.abudhabi.nyu.edu/. Open-source code is on GitHub at https://github.com/nasqar/NASQAR, and the system is also available as a Docker image at https://hub.docker.com/r/aymanm/nasqarall. NASQAR is a collaboration between the core bioinformatics teams of the NYU Abu Dhabi and NYU New York Centers for Genomics and Systems Biology.ConclusionsNASQAR empowers non-programming experts with a versatile and intuitive toolbox to easily and efficiently explore, analyze, and visualize their Transcriptomics data interactively. Popular tools for a variety of applications are currently available, including Transcriptome Data Preprocessing, RNA-seq Analysis (including Single-cell RNA-seq), Metagenomics, and Gene Enrichment.


2019 ◽  
Author(s):  
Chen Yang ◽  
Chenkai Li ◽  
Ka Ming Nip ◽  
René L Warren ◽  
Inanc Birol

AbstractAs a widespread RNA processing machinery, alternative polyadenylation plays a crucial role in gene regulation. To help decipher its underlying mechanism and understand its impact, it is desirable to comprehensively profile 3’-untranslated region cleavage and associated polyadenylation sites. State-of-the-art polyadenylation site detection tools are known to be influenced by library preparation artefacts or manually selected features. Moreover, recently published machine learning methods have only been tested on pre-constructed datasets, thus lacking validation on experimental data. Here we present Terminitor, the first deep neural network-based profiling pipeline to make predictions from RNA-seq data. We show how Terminitor outperforms competing tools in sensitivity and precision on experimental transcriptome sequencing data, and demonstrate its use with data from short- and long-read sequencing technologies. For species without a good reference transcriptome annotation, Terminitor is still able to pass on the information learnt from a related species and make reasonable predictions. We used Terminitor to showcase how single nucleotide variations can create or destroy polyadenylated cleavage sites in human RNA-seq samples.Author Summary3’ cleavage and polyadenylation of pre-mRNA is part of RNA maturation process. One gene can be cleaved at different positions at its 3’ end, namely alternatively polyadenylation, thus identifying the correct polyadenylated cleavage site (poly(A) CS) is essential to unveil its role in gene regulation under different physiological and pathological conditions. The current poly(A) CS prediction tools are either heavily influenced by RNA-Seq library preparation artefacts or have only been designed and tested on ad hoc datasets, lacking association with real world applications. In this study, we present a deep learning model, Terminitor, that predicts the probability of a nucleotide sequence containing a poly(A) CS, and validated its performance on human and mouse data. Along with the model, we propose a poly(A) CS profiling pipeline for RNA-seq data. We benchmarked our pipeline against competing tools and achieved higher sensitivity and precision in experimental data. The usage of Terminitor is not limited to genome and transcriptome annotation and we expect it to facilitate the identification of novel isoforms, improve the accuracy of transcript quantification and differential expression analysis, and contribute to the repertoire of reference transcriptome annotation.


2015 ◽  
Vol 9S4 ◽  
pp. BBI.S29333 ◽  
Author(s):  
Stefan E. Seemann ◽  
Christian Anthon ◽  
Oana Palasca ◽  
Jan Gorodkin

The era of high-throughput sequencing has made it relatively simple to sequence genomes and transcriptomes of individuals from many species. In order to analyze the resulting sequencing data, high-quality reference genome assemblies are required. However, this is still a major challenge, and many domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNA seq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data and genome assembly quality. We rank the genomes by their assembly quality and discuss the implications for genotype analyses.


Sign in / Sign up

Export Citation Format

Share Document