PBHoover and CigarRoller: a method for confident haploid variant calling on Pacific Biosciences data and its application to heterogeneous population analysis

Mapping Intimacies ◽

10.1101/360370 ◽

2018 ◽

Author(s):

Sarah Ramirez-Busby ◽

Afif Elghraoui ◽

Yeon Bin Kim ◽

Kellie Kim ◽

Faramarz Valafar

Keyword(s):

Single Molecule ◽

Error Rate ◽

De Novo ◽

Population Analysis ◽

Variant Calling ◽

High Sensitivity ◽

Sequencing Depth ◽

Smrt Sequencing ◽

Link Type ◽

Low Coverage

AbstractMotivationSingle Molecule Real-Time (SMRT) sequencing has important and underutilized advantages that amplification-based platforms lack. Lack of systematic error (e.g. GC-bias), completede novoassembly (including large repetitive regions) without scaffolding, can be mentioned. SMRT sequencing, however suffers from high random error rate and low sequencing depth (older chemistries). Here, we introduce PBHoover, software that uses a heuristic calling algorithm in order to make base calls with high certainty in low coverage regions. This software is also capable of mixed population detection with high sensitivity. PBHoover’s CigarRoller attachment improves sequencing depth in low-coverage regions through CIGAR-string correction.ResultsWe tested both modules on 348M.tuberculosisclinical isolates sequenced on C1 or C2 chemistries. On average, CigarRoller improved percentage of usable read count from 68.9% to 99.98% in C1 runs and from 50% to 99% in C2 runs. Using the greater depth provided by CigarRoller, PBHoover was able to make base and variant calls 99.95% concordant with Sanger calls (QV33). PBHoover also detected antibiotic-resistant subpopulations that went undetected by Sanger. Using C1 chemistry, subpopulations as small as 9% of the total colony can be detected by PBHoover. This provides the most sensitive amplification-free molecular method for heterogeneity analysis and is in line with phenotypic methods’ sensitivity. This sensitivity significantly improves with the greater depth and lower error rate of the newer chemistries.Availability and ImplementationExecutables are freely available under GNU GPL v3+ athttp://www.gitlab.com/LPCDRP/pbhooverandhttp://www.gitlab.com/LPCDRP/CigarRoller. PBHoover is also available on bioconda:https://anaconda.org/bioconda/[email protected]

Download Full-text

AsmMix: A pipeline for high quality diploid de novo assembly

10.1101/2021.01.15.426893 ◽

2021 ◽

Author(s):

Pei Wu ◽

Chao Liu ◽

Ou Wang ◽

Xia Zhao ◽

Fang Chen ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Variant Calling ◽

The Other ◽

Second Step ◽

Small Scale ◽

Mixing Process ◽

High Quality ◽

Single Molecule Sequencing ◽

Long Read

AbstractIn this paper, we report a pipeline, AsmMix, which is capable of producing both contiguous and high-quality diploid genomes. The pipeline consists of two steps. In the first step, two sets of assemblies are generated: one is based on co-barcoded reads, which are highly accurate and haplotype-resolved but contain many gaps, the other assembly is based on single-molecule sequencing reads, which is contiguous but error-prone. In the second step, those two sets of assemblies are compared and integrated into a haplotype-resolved assembly with fewer errors. We test our pipeline using a dataset of human genome NA24385, perform variant calling from those assemblies and then compare against GIAB Benchmark. We show that AsmMix pipeline could produce highly contiguous, accurate, and haplotype-resolved assemblies. Especially the assembly mixing process could effectively reduce small-scale errors in the long read assembly.

Download Full-text

Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads

PeerJ ◽

10.7717/peerj.2016 ◽

2016 ◽

Vol 4 ◽

pp. e2016 ◽

Cited By ~ 18

Author(s):

Chengxi Ye ◽

Zhanshan (Sam) Ma

Keyword(s):

Error Rate ◽

Genome Assembly ◽

De Novo ◽

Consensus Sequence ◽

Variant Calling ◽

Error Rates ◽

Consensus Algorithm ◽

High Quality ◽

Oxford Nanopore ◽

Generation Sequencing

Motivation.The third generation sequencing (3GS) technology generates long sequences of thousands of bases. However, its current error rates are estimated in the range of 15–40%, significantly higher than those of the prevalent next generation sequencing (NGS) technologies (less than 1%). Fundamental bioinformatics tasks such asde novogenome assembly and variant calling require high-quality sequences that need to be extracted from these long but erroneous 3GS sequences.Results.We describe a versatile and efficient linear complexity consensus algorithm Sparc to facilitatede novogenome assembly. Sparc builds a sparse k-mer graph using a collection of sequences from a targeted genomic region. The heaviest path which approximates the most likely genome sequence is searched through a sparsity-induced reweighted graph as the consensus sequence. Sparc supports using NGS and 3GS data together, which leads to significant improvements in both cost efficiency and computational efficiency. Experiments with Sparc show that our algorithm can efficiently provide high-quality consensus sequences using both PacBio and Oxford Nanopore sequencing technologies. With only 30× PacBio data, Sparc can reach a consensus with error rate <0.5%. With the more challenging Oxford Nanopore data, Sparc can also achieve similar error rate when combined with NGS data. Compared with the existing approaches, Sparc[i] calculates the consensus with higher accuracy, uses 80% less memory and time, approximately. The source code is available for download athttps://github.com/yechengxi/Sparc.

Download Full-text

Efficient long single molecule sequencing for cost effective and accurate sequencing, haplotyping, and de novo assembly

10.1101/324392 ◽

2018 ◽

Author(s):

Ou Wang ◽

Robert Chin ◽

Xiaofang Cheng ◽

Michelle Ka Wu ◽

Qing Mao ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

De Novo ◽

Low Cost ◽

Variant Calling ◽

Cost Effective ◽

High Quality ◽

Single Molecule Sequencing ◽

Single Tube ◽

Complex Structural

Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

Download Full-text

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

10.1101/008003 ◽

2014 ◽

Cited By ~ 13

Author(s):

Konstantin Berlin ◽

Sergey Koren ◽

Chen-Shan Chin ◽

James Drake ◽

Jane M Landolin ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Locality Sensitive Hashing ◽

Model Organisms ◽

Smrt Sequencing ◽

High Coverage ◽

Celera Assembler ◽

Single Molecule Sequencing ◽

Long Reads ◽

Long Read

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

Download Full-text

New synthetic-diploid benchmark for accurate variant calling evaluation

10.1101/223297 ◽

2017 ◽

Cited By ~ 9

Author(s):

Heng Li ◽

Jonathan M Bloom ◽

Yossi Farjoun ◽

Mark Fleharty ◽

Laura Gauthier ◽

...

Keyword(s):

Cell Lines ◽

Human Cell ◽

Error Rate ◽

De Novo ◽

Variant Calling ◽

Benchmark Dataset ◽

Whole Genome ◽

Human Cell Lines ◽

Short Read ◽

Benchmark Datasets

Constructed from the consensus of multiple variant callers based on short-read data, existing benchmark datasets for evaluating variant calling accuracy are biased toward easy regions accessible by known algorithms. We derived a new benchmark dataset from the de novo PacBio assemblies of two human cell lines that are homozygous across the whole genome. This benchmark provides a more accurate and less biased estimate of the error rate of small variant calls in a realistic context.

Download Full-text

A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer

10.1101/009613 ◽

2014 ◽

Author(s):

Josh Quick ◽

Aaron Quinlan ◽

Nicholas Loman

Keyword(s):

Single Molecule ◽

De Novo ◽

Sequence Data ◽

Bacterial Genome ◽

Model Organism ◽

Variant Calling ◽

Laptop Computer ◽

Early Access ◽

Dna Strands ◽

K 12

Background: The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. By measuring the change in current produced when DNA strands translocate through and interact with a charged protein nanopore the device is able to deduce the underlying nucleotide sequence. Findings: We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Conclusions: Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods. Datasets are provided through the GigaDB database at http://gigadb.org/dataset/100102

Download Full-text

Detection of cytosine methylation in Burkholderia cenocepacia by single-molecule real-time sequencing and whole-genome bisulfite sequencing

Microbiology ◽

10.1099/mic.0.001027 ◽

2021 ◽

Author(s):

Ian Vandenbussche ◽

Andrea Sass ◽

Filip Van Nieuwerburgh ◽

Marta Pinto-Carbó ◽

Olga Mannweiler ◽

...

Keyword(s):

Gene Expression ◽

Single Molecule ◽

Type Species ◽

Cytosine Methylation ◽

Regulation Of Gene Expression ◽

Burkholderia Cenocepacia ◽

Smrt Sequencing ◽

Content Type ◽

Link Type ◽

Genome Bisulfite Sequencing

Research on prokaryotic epigenetics, the study of heritable changes in gene expression independent of sequence changes, led to the identification of DNA methylation as a versatile regulator of diverse cellular processes. Methylation of adenine bases is often linked to regulation of gene expression in bacteria, but cytosine methylation is also frequently observed. In this study, we present a complete overview of the cytosine methylome in Burkholderia cenocepacia , an opportunistic respiratory pathogen in cystic fibrosis patients. Single-molecule real-time (SMRT) sequencing was used to map all 4mC-modified cytosines, as analysis of the predicted MTases in the B. cenocepacia genome revealed the presence of a 4mC-specific phage MTase, M.BceJII, targeting GGCC sequences. Methylation motif GCGGCCGC was identified, and out of 6850 motifs detected across the genome, 2051 (29.9 %) were methylated at the fifth position. Whole-genome bisulfite sequencing (WGBS) was performed to map 5mC methylation and 1635 5mC-modified cytosines were identified in CpG motifs. A comparison of the genomic positions of the modified bases called by each method revealed no overlap, which confirmed the authenticity of the detected 4mC and 5mC methylation by SMRT sequencing and WGBS, respectively. Large inter-strain variation of the 4mC-methylated cytosines was observed when B. cenocepacia strains J2315 and K56-2 were compared, which suggests that GGCC methylation patterns in B. cenocepacia are strain-specific. It seems likely that 4mC methylation of GGCC is not involved in regulation of gene expression but rather is a remnant of bacteriophage invasion, in which methylation of the phage genome was crucial for protection against restriction-modification systems of B. cenocepacia .

Download Full-text

Complete Genome Sequence of Pseudomonas aeruginosa Phage-Resistant Variant PA1RG

Genome Announcements ◽

10.1128/genomea.01761-15 ◽

2016 ◽

Vol 4 (1) ◽

Cited By ~ 4

Author(s):

Gang Li ◽

Shuguang Lu ◽

Mengyu Shen ◽

Shuai Le ◽

Yinling Tan ◽

...

Keyword(s):

Pseudomonas Aeruginosa ◽

Genome Sequence ◽

Single Molecule ◽

Complete Genome Sequence ◽

Complete Genome ◽

De Novo ◽

Smrt Sequencing ◽

Sequence Coverage ◽

Resistant Variant ◽

Defense Systems

Bacteria have evolved several defense systems against phage predation. Here, we report the 6,500,439-bp complete genome sequence of the Pseudomonas aeruginosa phage-resistant variant PA1RG. Single-molecule real-time (SMRT) sequencing and de novo assembly revealed a single contig with 320-fold sequence coverage.

Download Full-text

Assessment of an organ-specific de novo transcriptome of the nematode trap-crop, Solanum sisymbriifolium

10.1101/256065 ◽

2018 ◽

Author(s):

Alexander Q Wixom ◽

N Carol Casavant ◽

Joseph C Kuhl ◽

Fangming Xiao ◽

Louise-Marie Dandurand ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Globodera Pallida ◽

Smrt Sequencing ◽

Trap Crop ◽

Site Analysis ◽

Organ Specific ◽

Disproportionate Number ◽

Single Time Point ◽

Solanum Sisymbriifolium

AbstractSolanum sisymbriifolium, also known as “Litchi Tomato” or “Sticky Nightshade,” is an undomesticated and poorly researched plant related to potato and tomato. Unlike the latter species, S. sisymbriifolium induces eggs of the cyst nematode, Globodera pallida, to hatch and migrate into its roots, but then arrests further nematode maturation. In order to provide researchers with a partial blueprint of its genetic make-up so that the mechanism of this response might be identified, we used single molecule real time (SMRT) sequencing to compile a high quality de novo transcriptome of 41,189 unigenes drawn from individually sequenced bud, root, stem, and leaf RNA populations. Functional annotation and BUSCO analysis showed that this transcriptome was surprisingly complete, even though it represented genes expressed at a single time point. By sequencing the 4 organ libraries separately, we found we could get a reliable snapshot of transcript distributions in each organ. A divergent site analysis of the merged transcriptome indicated that this species might have undergone a recent genome duplication and re-diploidization. Further analysis indicated that the plant then retained a disproportionate number of genes associated with photosynthesis and amino acid metabolism in comparison to genes with characteristics of R-proteins or involved in secondary metabolism. The former processes may have given S. sisymbriifolium a bigger competitive advantage than the latter did.

Download Full-text

Virtual Genome Walking: Generating gene models for the salamander Ambystoma mexicanum

10.1101/185157 ◽

2017 ◽

Cited By ~ 1

Author(s):

Teri Evans ◽

Andrew Johnson ◽

Matt Loose

Keyword(s):

De Novo ◽

Genomic Sequence ◽

Artificial Chromosome ◽

Ambystoma Mexicanum ◽

Genome Walking ◽

Model Systems ◽

Link Type ◽

Repeat Content ◽

Gene Models ◽

Low Coverage

AbstractLarge repeat rich genomes present challenges for assembly and identification of gene models with short read technologies. Here we present a method we call Virtual Genome Walking which uses an iterative assembly approach to first identify exons from de-novo assembled transcripts and assemble whole genome reads against each exon. This process is iterated allowing the extension of exons. These linked assemblies are refined to generate gene models including upstream and downstream genomic sequence as well as intronic sequence. We test this method using a 20X genomic read set for the axolotl, the genome of which is estimated to be 30 Gb in size. These reads were previously reported to be effectively impossible to assemble. Here we provide almost 1 Gb of assembled sequence describing over 19,000 gene models for the axolotl. Gene models stop assembling either due to localised low coverage in the genomic reads, or the presence of repeats. We validate our observations by comparison with previously published axolotl bacterial artificial chromosome (BAC) sequences. In addition we analysed axolotl intron length, intron-exon structure, repeat content and synteny. These gene-models, sequences and annotations are freely available for download from https://tinyurl.com/y8gydc6n. The software pipeline including a docker image is available from https://github.com/LooseLab/iterassemble. These methods will increase the value of low coverage sequencing of understudied model systems.

Download Full-text