A unified haplotype-based method for accurate and comprehensive variant calling

Mapping Intimacies ◽

10.1101/456103 ◽

2018 ◽

Cited By ~ 3

Author(s):

Daniel P Cooke ◽

David C Wedge ◽

Gerton Lunter

Keyword(s):

De Novo ◽

Variant Calling ◽

Normal Sample ◽

Sequencing Data ◽

Somatic Variation ◽

Data Set ◽

Small Complex ◽

Physical Linkage ◽

Germline Variation ◽

Almost All

Haplotype-based variant callers, which consider physical linkage between variant sites, are currently among the best tools for germline variation discovery and genotyping from short-read sequencing data. However, almost all such tools were designed specifically for detecting common germline variation in diploid populations, and give sub-optimal results in other scenarios. Here we present Octopus, a versatile haplotype-based variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. We show that Octopus accurately calls de novo mutations in parent-offspring trios and germline variants in individuals, including SNVs, indels, and small complex replacements such as microinversions. In addition, using a carefully designed synthetic-tumour data set derived from clean sequencing data from a sample with known germline haplotypes, and observed mutations in large cohort of tumour samples, we show that Octopus accurately characterizes germline and somatic variation in tumours, both with and without a paired normal sample. Sequencing reads and prior information are combined to phase called genotypes of arbitrary ploidy, including those with somatic mutations. Octopus also outputs realigned evidence BAMs to aid validation and interpretation.

Download Full-text

Aquila_stLFR: assembly based variant calling package for stLFR and hybrid assembly for linked-reads

10.1101/742239 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xin Zhou ◽

Lu Zhang ◽

Xiaodong Fang ◽

Yichen Liu ◽

David L. Dill ◽

...

Keyword(s):

De Novo ◽

Low Cost ◽

Variant Calling ◽

Hybrid Assembly ◽

Structural Variants ◽

Sequencing Data ◽

Single Tube ◽

Large Numbers ◽

Key Characteristics ◽

Hybrid Assemblies

AbstractHuman diploid genome assembly enables identifying maternal and paternal genetic variations. Algorithms based on 10x linked-read sequencing have been developed for de novo assembly, variant calling and haplotyping. Another linked-read technology, single tube long fragment read (stLFR), has recently provided a low-cost single tube solution that can enable long fragment data. However, no existing software is available for human diploid assembly and variant calls. We develop Aquila stLFR to adapt to the key characteristics of stLFR. Aquila stLFR assembles near perfect diploid assembled contigs, and the assembly-based variant calling shows that Aquila stLFR detects large numbers of structural variants which were not easily spanned by Illumina short-reads. Furthermore, the hybrid assembly mode Aquila hybrid allows a hybrid assembly based on both stLFR and 10x linked-reads libraries, demonstrating that these two technologies can always be complementary to each other for assembly to improve contiguity and the variants detection, regardless of assembly quality of the library itself from single sequencing technology. The overlapped structural variants (SVs) from two independent sequencing data of the same individual, and the SVs from hybrid assemblies provide us a high-confidence profile to study them.AvailabilitySource code and documentation are available on https://github.com/maiziex/Aquila_stLFR.

Download Full-text

DeNovoCNN: A deep learning approach to de novo variant calling in next generation sequencing data

10.1101/2021.09.20.461072 ◽

2021 ◽

Author(s):

Gelana Khazeeva ◽

Karolis Sablauskas ◽

Bart van der Sanden ◽

Wouter Steyaert ◽

Michael Kwint ◽

...

Keyword(s):

Exome Sequencing ◽

De Novo ◽

Genetic Disorders ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Accurate Identification ◽

Whole Exome ◽

De Novo Variant ◽

Generation Sequencing

De novo mutations (DNMs) are an important cause of genetic disorders. The accurate identification of DNMs from sequencing data is therefore fundamental to rare disease research and diagnostics. Unfortunately, identifying reliable DNMs remains a major challenge due to sequence errors, uneven coverage, and mapping artifacts. Here, we developed a deep convolutional neural network (CNN) DNM caller (DeNovoCNN), that encodes alignment of sequence reads for a trio as 160×164 resolution images. DeNovoCNN was trained on DNMs of whole exome sequencing (WES) of 2003 trios achieving on average 99.2% recall and 93.8% precision. We find that DeNovoCNN has increased recall/sensitivity and precision compared to existing de novo calling approaches (GATK, DeNovoGear, Samtools) based on the Genome in a Bottle reference dataset. Sanger validations of DNMs called in both exome and genome datasets confirm that DeNovoCNN outperforms existing methods. Most importantly, we show that DeNovoCNN is robust against different exome sequencing and analyses approaches, thereby allowing it to be applied on other datasets. DeNovoCNN is freely available and can be run on existing alignment (BAM/CRAM) and variant calling (VCF) files from WES and WGS without a need for variant recalling.

Download Full-text

Extensive sequencing of seven human genomes to characterize benchmark reference materials

10.1101/026468 ◽

2015 ◽

Cited By ~ 9

Author(s):

Justin M Zook ◽

David Catoe ◽

Jennifer McDaniel ◽

Lindsay Vang ◽

Noah Spies ◽

...

Keyword(s):

Human Genome ◽

Reference Materials ◽

De Novo ◽

Variant Calling ◽

Genome Project ◽

Genome Comparison ◽

Personal Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Download Full-text

ntHits: de novo repeat identification of genomics data using a streaming approach

10.1101/2020.11.02.365809 ◽

2020 ◽

Author(s):

Hamid Mohamadi ◽

Justin Chu ◽

Lauren Coombe ◽

Rene Warren ◽

Inanc Birol

Keyword(s):

Dna Sequencing ◽

Large Scale ◽

De Novo ◽

Software Tool ◽

Segmental Duplications ◽

Sequencing Data ◽

Exact Methods ◽

Data Set ◽

Repeat Elements ◽

Streaming Algorithm

AbstractMotivationRepeat elements such as satellites, transposons, high number of gene copies, and segmental duplications are abundant in eukaryotic genomes. They often induce many local alignments, complicating sequence assembly and comparisons between genomes and analysis of large-scale duplications and rearrangements. Hence, identification and classification of repeats is a fundamental step in many genomics applications and their downstream analysis tools.ResultsIn this work, we present an efficient streaming algorithm and software tool, ntHits, for de novo repeat identification based on the statistical analysis of the k-mer content profile of large-scale DNA sequencing data. In the proposed algorithm, we first obtain the k-mer coverage histograms of input datasets using the ntCard algorithm, an efficient streaming algorithm for estimating the k-mer coverage histograms. From the obtained k-mer coverage histogram, the repetitive k-mers would present a long tail to the distribution of k-mer coverage profile. Experimental results show that ntHits can efficiently and accurately identify the repeat content in large-scale DNA sequencing data. For example, ntHits accurately identifies the repeat k-mers in the white spruce sequencing data set with 96× sequencing coverage in about 12 hours and using less than 150GB of memory, while using the exact methods for reporting the repeated k-mers takes several days and terabytes of memory and disk space.AvailabilityntHits is written in C++ and is released under the MIT License. It is freely available at https://github.com/bcgsc/[email protected]

Download Full-text

Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project

Wellcome Open Research ◽

10.12688/wellcomeopenres.15126.2 ◽

2019 ◽

Vol 4 ◽

pp. 50 ◽

Cited By ~ 7

Author(s):

Ernesto Lowy-Gallego ◽

Susan Fairley ◽

Xiangqun Zheng-Bradley ◽

Magali Ruffier ◽

Laura Clarke ◽

...

Keyword(s):

De Novo ◽

Variant Calling ◽

Final Phase ◽

1000 Genomes Project ◽

Data Set ◽

1000 Genomes ◽

Project Data

We present a set of biallelic SNVs and INDELs, from 2,548 samples spanning 26 populations from the 1000 Genomes Project, called de novo on GRCh38. We believe this will be a useful reference resource for those using GRCh38. It represents an improvement over the “lift-overs” of the 1000 Genomes Project data that have been available to date by encompassing all of the GRCh38 primary assembly autosomes and pseudo-autosomal regions, including novel, medically relevant loci. Here, we describe how the data set was created and benchmark our call set against that produced by the final phase of the 1000 Genomes Project on GRCh37 and the lift-over of that data to GRCh38.

Download Full-text

UVC: universality-based calling of small variants using pseudo-neural networks

10.1101/2020.08.23.263749 ◽

2020 ◽

Author(s):

Xiaofei Zhao ◽

Allison Hu ◽

Sizhen Wang ◽

Xiaoyue Wang

Keyword(s):

Neural Network ◽

State Of The Art ◽

Variant Calling ◽

The State ◽

Training Data ◽

Normal Sample ◽

Sequencing Data ◽

Damage Repair ◽

Biological Insight ◽

Sensitivity Specificity

AbstractWe describe UVC (https://github.com/genetronhealth/uvc), an open-source method for calling small somatic variants. UVC is aware of both unique molecular identifiers (UMIs) and the tumor-matched normal sample. UVC utilizes the following power-law universality that we discovered: allele fraction is inversely proportional to the cubic root of variant-calling error rate. Moreover, UVC utilizes pseudo-neural network (PNN). PNN is similar to deep neural network but does not require any training data. UVC outperformed Mageri and smCounter2, the state-of-the-art UMI-aware variant callers, on the tumor-only datasets used for publishing these two variant callers. Also, UVC outperformed Mutect2 and Strelka2, the state-of-the-art variant callers for tumor-normal pairs, on the Genome-in-a-Bottle somatic truth sets. UVC outperformed Mutect2 and Strelka2 on 21 in silico mixtures simulating 21 combinations of tumor purity and normal purity. Performance is measured by using sensitivity-specificity trade off for all called variants. The improved variant calls generated by UVC from previously published UMI-based sequencing data are able to provide additional biological insight about DNA damage repair. The versatility and robustness of UVC makes it a useful tool for variant calling in clinical settings.

Download Full-text

Analysis of a small outbreak of Shiga toxin-producing Escherichia coli O157:H7 using long-read sequencing

Microbial Genomics ◽

10.1099/mgen.0.000545 ◽

2021 ◽

Vol 7 (3) ◽

Author(s):

David R. Greig ◽

Claire Jenkins ◽

Saheer E. Gharbia ◽

Timothy J. Dallman

Keyword(s):

Reference Genome ◽

Genetic Relatedness ◽

De Novo ◽

Methodological Approach ◽

Foodborne Pathogen ◽

Variant Calling ◽

Sequencing Data ◽

Deletion Event ◽

Base Calling ◽

Long Read

Compared to short-read sequencing data, long-read sequencing facilitates single contiguous de novo assemblies and characterization of the prophage region of the genome. Here, we describe our methodological approach to using Oxford Nanopore Technology (ONT) sequencing data to quantify genetic relatedness and to look for microevolutionary events in the core and accessory genomes to assess the within-outbreak variation of four genetically and epidemiologically linked isolates. Analysis of both Illumina and ONT sequencing data detected one SNP between the four sequences of the outbreak isolates. The variant calling procedure highlighted the importance of masking homologous sequences in the reference genome regardless of the sequencing technology used. Variant calling also highlighted the systemic errors in ONT base-calling and ambiguous mapping of Illumina reads that results in variations in the genetic distance when comparing one technology to the other. The prophage component of the outbreak strain was analysed, and nine of the 16 prophages showed some similarity to the prophage in the Sakai reference genome, including the stx2a-encoding phage. Prophage comparison between the outbreak isolates identified minor genome rearrangements in one of the isolates, including an inversion and a deletion event. The ability to characterize the accessory genome in this way is the first step to understanding the significance of these microevolutionary events and their impact on the evolutionary history, virulence and potentially the likely source and transmission of this zoonotic, foodborne pathogen.

Download Full-text

Custom workflows to improve joint variant calling from multiple related tumour samples: FreeBayesSomatic and Strelka2Pass

Bioinformatics ◽

10.1093/bioinformatics/btab606 ◽

2021 ◽

Author(s):

S Hollizeck ◽

S Q Wong ◽

B Solomon ◽

D Chandranada ◽

S-J Dawson

Keyword(s):

Somatic Mutations ◽

Source Code ◽

Variant Calling ◽

Supplementary Information ◽

Normal Sample ◽

Supplementary Data ◽

Sequencing Data ◽

Normal Pair ◽

Single Tumour ◽

Matched Normal Sample

Abstract Summary This work describes two novel workflows for variant calling that extend the widely used algorithms of Strelka2 and FreeBayes to call somatic mutations from multiple related tumour samples and one matched normal sample. We show that these workflows offer higher precision and recall than their single tumour-normal pair equivalents in both simulated and clinical sequencing data. Availability and Implementation Source code freely available at the following link: https://atlassian.petermac.org.au/bitbucket/projects/DAW/repos/multisamplevariantcalling and executable through Janis (https://github.com/PMCC-BioinformaticsCore/janis) under the GPLv3 licence. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A validated lineage-derived somatic truth data set enables benchmarking in cancer genome analysis

Communications Biology ◽

10.1038/s42003-020-01460-9 ◽

2020 ◽

Vol 3 (1) ◽

Author(s):

Megan Shand ◽

Jose Soto ◽

Lee Lichtenstein ◽

David Benjamin ◽

Yossi Farjoun ◽

...

Keyword(s):

Confidence Region ◽

Cell Lineage ◽

Cancer Cell Line ◽

Colon Cancer Cell Line ◽

Synthetic Methods ◽

Data Sets ◽

Sequencing Data ◽

Somatic Variation ◽

Data Set ◽

Benchmark Data

AbstractExisting cancer benchmark data sets for human sequencing data use germline variants, synthetic methods, or expensive validations, none of which are satisfactory for providing a large collection of true somatic variation across a whole genome. Here we propose a data set, Lineage derived Somatic Truth (LinST), of short somatic mutations in the HT115 colon cancer cell-line, that are validated using a known cell lineage that includes thousands of mutations and a high confidence region covering 2.7 gigabases per sample.

Download Full-text

Reanalysis of deep-sequencing data from Austria points towards a small SARS-COV-2 transmission bottleneck on the order of one to three virions

10.1101/2021.02.22.432096 ◽

2021 ◽

Author(s):

Michael A. Martin ◽

Katia Koelle

Keyword(s):

Genetic Variation ◽

Deep Sequencing ◽

De Novo ◽

Low Frequency ◽

Variant Calling ◽

Population Level ◽

Sequencing Data ◽

Deep Sequencing Data ◽

Computational Analyses ◽

Transmission Bottleneck

An early analysis of SARS-CoV-2 deep-sequencing data that combined epidemiological and genetic data to characterize the transmission dynamics of the virus in and beyond Austria concluded that the size of the virus’s transmission bottleneck was large – on the order of 1000 virions. We performed new computational analyses using these deep-sequenced samples from Austria. Our analyses included characterization of transmission bottleneck sizes across a range of variant calling thresholds and examination of patterns of shared low-frequency variants between transmission pairs in cases where de novo genetic variation was present in the recipient. From these analyses, among others, we found that SARS-CoV-2 transmission bottlenecks are instead likely to be very tight, on the order of 1-3 virions. These findings have important consequences for understanding how SARS-CoV-2 evolves between hosts and the processes shaping genetic variation observed at the population level.

Download Full-text