scholarly journals TNscope: Accurate Detection of Somatic Mutations with Haplotype-based Variant Candidate Detection and Machine Learning Filtering

2018 ◽  
Author(s):  
Donald Freed ◽  
Renke Pan ◽  
Rafael Aldana

AbstractDetection of somatic mutations in tumor samples is important in the clinic, where treatment decisions are increasingly based upon molecular diagnostics. However, accurate detection of these mutations is difficult, due in part to intra-tumor heterogeneity, contamination of the tumor sample with normal tissue and pervasive structural variation. Here, we describe Sentieon TNscope, a haplotype-based somatic variant caller with increased accuracy relative to existing methods. An early engineering version of TNscope was used in our submission to the most recent ICGC-DREAM Somatic Mutation calling challenge. In that challenge, TNscope is the leader in accuracy for SNVs, indels and SVs. To further improve variant calling accuracy, we combined the improvements in the variant caller with machine learning. We benchmarked TNscope using in-silico mixtures of well-characterized Genome in a Bottle (GIAB) samples. TNscope displays higher accuracy than the other benchmarked tools and the accuracy is substantially improved by the machine learning model.

2021 ◽  
Author(s):  
Parikshit Sanyal ◽  
Sayak Paul ◽  
Avinash Das

Introduction Machine learning and artificial intelligence (AI) models have been applied in histopathology to solve specific problems like detection of metastasis in lymph nodes and immunohistochemical scoring. We have aimed to develop a machine learning model which can be trained in histopathology from the basics, i.e. identification of normal tissue. We have tried to replicate the process through which a human pathologist learns recognition of normal tissue from histological sections, and evaluate the performance of a machine learning model at this task. Materials and methods A total of 658 histologic images were anonymised, microphotographed at 10x magnification, under the same condition of illumination, with a Magnus DC5 integrated microphotography system. The images were split into two subsets, training (386) and validation (272 images). The images belonged to seven classes of tissue: brain, intestine, kidney, liver, lungs, muscle and skin. Archived material of the hospital were used for the study. A machine learning model using convolutional neural network (CNN) was developed on the Keras platform, using the convolution layers of a pretrained VGG16 model. The model was trained with the training set of images over 10 epochs. After training, performance of the model was assessed on the validation set. Results The model achieved 88.24% accuracy in classifying the images of the validation set. The most frequent errors were met in recognising images of kidney (14 errors, 33.33%). The commonest error was wrongly classifying kidney tissue as liver (07 errors). Analysis of the deeper layers of the neural network revealed specific patterns in images which were wrongly classified. Conclusion The results of the present study indicates that a convolutional neural network might be trained in histology similar to a trainee pathologist. The study represents the first step towards developing a machine learning model as a generalised histopathological image classifier.


2018 ◽  
Author(s):  
Rebecca F. Halperin ◽  
Winnie S. Liang ◽  
Sidharth Kulkarni ◽  
Erica E. Tassone ◽  
Jonathan Adkins ◽  
...  

AbstractArchival tumor samples represent a potential rich resource of annotated specimens for translational genomics research. However, standard variant calling approaches require a matched normal sample from the same individual, which is often not available in the retrospective setting, making it difficult to distinguish between true somatic variants and germline variants that are private to the individual. Archival sections often contain adjacent normal tissue, but this normal tissue can include infiltrating tumor cells. Comparative somatic variant callers are designed to exclude variants present in the normal sample, so a novel approach is required to leverage sequencing of adjacent normal tissue for somatic variant calling. Here we present LumosVar 2.0, a software package designed to jointly analyze multiple samples from the same patient. The approach is based on the concept that the allelic fraction of somatic variants, but not germline variants, would be reduced in samples with low tumor content. LumosVar 2.0 estimates allele specific copy number and tumor sample fractions from the data, and uses the model to determine expected allelic fractions for somatic and germline variants and classify variants accordingly. To evaluate using LumosVar 2.0 to jointly call somatic variants with tumor and adjacent normal samples, we used a glioblastoma dataset with matched high tumor content, low tumor content, and germline exome sequencing data (to define true somatic variants) available for each patient. We show that both sensitivity and positive predictive value are improved by analyzing the high tumor and low tumor samples jointly compared to analyzing the samples individually or compared to in-silico pooling of the two samples. Finally, we applied this approach to a set of breast and prostate archival tumor samples for which normal samples were not available for germline sequencing, but tumor blocks containing adjacent normal tissue were available for sequencing. Joint analysis using LumosVar 2.0 detected several variants, including known cancer hotspot mutations that were not detected by standard somatic variant calling tools using the adjacent normal as a reference. Together, these results demonstrate the potential utility of leveraging paired tissue samples to improve somatic variant calling when a constitutional DNA sample is not available.


Blood ◽  
2018 ◽  
Vol 132 (Supplement 1) ◽  
pp. 5275-5275
Author(s):  
Stephan Hutter ◽  
Niroshan Nadarajah ◽  
Manja Meggendorfer ◽  
Wolfgang Kern ◽  
Torsten Haferlach ◽  
...  

Abstract Background: The human genome is very heterogeneous on the individual level which challenges interpretation of whole genome sequencing (WGS) data. In order to reduce complexity in tumor genetics WGS of a tumor is performed together with WGS of "normal" tissue from the respective patient (i.e. fingernails, skin biopsy, hair, buccal swaps) which is used as the germline sequence (tumor/matched normal approach, TMNA). This approach allows the extraction of somatic mutations acquired in the tumor through sophisticated algorithms. In routine diagnostics, especially in hematological neoplasms, "normal" tissue representing the germline sequence is usually not available, which prohibits the standard use of somatic tumor/normal variant calling tools. Aims: On the road to implement WGS into routine diagnostics we tested a TMNA in comparison to a tumor/unmatched normal approach (TUNA), where pooled genomic DNA (Promega, Fitchburg, WI) was used instead of a matched normal. Cohorts and Methods: 9 samples from patients with hematological neoplasms (7 AML, 2 ALL) were sequenced at diagnosis on Illumina HiSeqX machines (Illumina, San Diego, CA), along with complete remission samples to serve as matched normals for the TMNA. For comparison, a mixture of genomic DNA from multiple anonymous donors was used as "normal" for the TUNA. Read mapping and somatic variant calling was performed using the tools Isaac3 and Strelka2, respectively. Statistical differences between groups were assessed by two-sided Mann-Whitney tests. Results: The TMNA produced a median of 17,700 somatic variant calls, while the TUNA produced 419,000. This 24-fold disparity is mainly due to residual germline variants missed by the TUNA. A large fraction of TMNA variants (57%) was located in regions of known low confidence variant calling (as defined by the Genome in a Bottle Consortium) and likely contain mostly artifacts. After removing these regions from analysis a median of 7,700 and 331,000 variants remained in the TMNA and TUNA datasets, respectively. In order to eliminate germline variants, the gnomAD population database was queried and any present variants were discarded. As expected, this removed over 95% of all variants from the TUNA dataset, but also 41% from the TMNA dataset. The latter might be attributed to common germline variants falsely being called as somatic by the TMNA and/or somatic mutations occurring at polymorphic sites. After this filtering step a median of 3,770 and 15,500 variants remained in the TMNA and TUNA datasets, respectively. This 4-fold disparity in variant number is most likely caused by rare germline variation remaining in the TUNA dataset. Of the remaining TMNA variants only 65% could be found within the larger TUNA dataset. A major factor governing this observation was variant allele frequency (VAF). Variants that overlapped between both datasets had on average higher VAFs than those unique to the TMNA (p < 2.2x10-16). Further inspection of the VAF distribution among samples revealed a bimodal or nearly bimodal distribution for all samples. All distributions shared a sharp peak centered on a VAF of 10%, which was unexpected given the estimated tumor fractions of the samples predict VAFs of 25% and higher. Variants in this lower part of the distribution (arbitrarily defined as VAFs < 20%) constitute on average 50% of all variants in a TMNA sample, with extremes reaching 95% in 2 samples. These low frequency variants show distinctly lower mapping qualities than variants with VAFs ≥ 20% (p < 2.2x10-16), i.e. they reside in regions of elevated mapping ambiguity which potentially leads to the creation of artefacts. Analyzing the overlap of only the higher VAF variants we find that 97.4% of all TMNA variants can also be found in the TUNA dataset. Conclusions: Comparing tumor samples to matched normal material from the respective patient is the preferred approach for somatic variant calling in WGS data, however even with modern algorithms false positives due to technical artifacts seem to be highly abundant. A deeper understanding of the nature of these artifacts is crucial for developing appropriate filtering schemes and improving variant calling algorithms. In the absence of a matched normal using a TUNA can uncover the vast majority (97.4%) of high-quality variants found in a TMNA, however distinguishing true somatic variants from residual rare germline variation in a TUNA remains a major challenge. Disclosures Hutter: MLL Munich Leukemia Laboratory: Employment. Nadarajah:MLL Munich Leukemia Laboratory: Employment. Meggendorfer:MLL Munich Leukemia Laboratory: Employment. Kern:MLL Munich Leukemia Laboratory: Employment, Equity Ownership. Haferlach:MLL Munich Leukemia Laboratory: Employment, Equity Ownership. Haferlach:MLL Munich Leukemia Laboratory: Employment, Equity Ownership.


2018 ◽  
Author(s):  
Steen Lysgaard ◽  
Paul C. Jennings ◽  
Jens Strabo Hummelshøj ◽  
Thomas Bligaard ◽  
Tejs Vegge

A machine learning model is used as a surrogate fitness evaluator in a genetic algorithm (GA) optimization of the atomic distribution of Pt-Au nanoparticles. The machine learning accelerated genetic algorithm (MLaGA) yields a 50-fold reduction of required energy calculations compared to a traditional GA.


2019 ◽  
Author(s):  
Siddhartha Laghuvarapu ◽  
Yashaswi Pathak ◽  
U. Deva Priyakumar

Recent advances in artificial intelligence along with development of large datasets of energies calculated using quantum mechanical (QM)/density functional theory (DFT) methods have enabled prediction of accurate molecular energies at reasonably low computational cost. However, machine learning models that have been reported so far requires the atomic positions obtained from geometry optimizations using high level QM/DFT methods as input in order to predict the energies, and do not allow for geometry optimization. In this paper, a transferable and molecule-size independent machine learning model (BAND NN) based on a chemically intuitive representation inspired by molecular mechanics force fields is presented. The model predicts the atomization energies of equilibrium and non-equilibrium structures as sum of energy contributions from bonds (B), angles (A), nonbonds (N) and dihedrals (D) at remarkable accuracy. The robustness of the proposed model is further validated by calculations that span over the conformational, configurational and reaction space. The transferability of this model on systems larger than the ones in the dataset is demonstrated by performing calculations on select large molecules. Importantly, employing the BAND NN model, it is possible to perform geometry optimizations starting from non-equilibrium structures along with predicting their energies.


2018 ◽  
Vol 1 (1) ◽  
pp. 236-247
Author(s):  
Divya Srivastava ◽  
Rajitha B. ◽  
Suneeta Agarwal

Diseases in leaves can cause the significant reduction in both quality and quantity of agricultural production. If early and accurate detection of disease/diseases in leaves can be automated, then the proper remedy can be taken timely. A simple and computationally efficient approach is presented in this paper for disease/diseases detection on leaves. Only detecting the disease is not beneficial without knowing the stage of disease thus the paper also determine the stage of disease/diseases by quantizing the affected of the leaves by using digital image processing and machine learning. Though there exists a variety of diseases on leaves, but the bacterial and fungal spots (Early Scorch, Late Scorch, and Leaf Spot) are the most prominent diseases found on leaves. Keeping this in mind the paper deals with the detection of Bacterial Blight and Fungal Spot both at an early stage (Early Scorch) and late stage (Late Scorch) on the variety of leaves. The proposed approach is divided into two phases, in the first phase, it identifies one or more disease/diseases existing on leaves. In the second phase, amount of area affected by the disease/diseases is calculated. The experimental results obtained showed 97% accuracy using the proposed approach.


2019 ◽  
Vol 15 (3) ◽  
pp. 206-211 ◽  
Author(s):  
Jihui Tang ◽  
Jie Ning ◽  
Xiaoyan Liu ◽  
Baoming Wu ◽  
Rongfeng Hu

<P>Introduction: Machine Learning is a useful tool for the prediction of cell-penetration compounds as drug candidates. </P><P> Materials and Methods: In this study, we developed a novel method for predicting Cell-Penetrating Peptides (CPPs) membrane penetrating capability. For this, we used orthogonal encoding to encode amino acid and each amino acid position as one variable. Then a software of IBM spss modeler and a dataset including 533 CPPs, were used for model screening. </P><P> Results: The results indicated that the machine learning model of Support Vector Machine (SVM) was suitable for predicting membrane penetrating capability. For improvement, the three CPPs with the most longer lengths were used to predict CPPs. The penetration capability can be predicted with an accuracy of close to 95%. </P><P> Conclusion: All the results indicated that by using amino acid position as a variable can be a perspective method for predicting CPPs membrane penetrating capability.</P>


Sign in / Sign up

Export Citation Format

Share Document