SVCurator: A Crowdsourcing app to visualize evidence of structural variants for the human genome

Mapping Intimacies ◽

10.1101/581264 ◽

2019 ◽

Cited By ~ 3

Author(s):

Lesley M Chapman ◽

Noah Spies ◽

Patrick Pai ◽

Chun Shen Lim ◽

Andrew Carroll ◽

...

Keyword(s):

Human Genome ◽

Reference Genome ◽

Ashkenazi Jewish ◽

Structural Variants ◽

Sequencing Data ◽

Sequencing Technologies ◽

Size Accuracy ◽

Large Indels ◽

Web Platform ◽

Reference Samples

AbstractA high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is yet to be defined. In this study, we manually curated 1235 SVs which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app – SVCurator – to help curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy.SVCurator is a Python Flask-based web platform that displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002], We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. The crowdsourced results were highly concordant with 37 out of the 61 curators having at least 78% concordance with a set of ‘expert’ curators, where there was 93% concordance amongst ‘expert’ curators. This produced high confidence labels for 935 events. When compared to the heuristic-based draft benchmark SV callset from GIAB, the SVCurator crowdsourced labels were 94.5% concordant with the benchmark set. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies.

Download Full-text

The design and construction of reference pangenome graphs with minigraph

Genome Biology ◽

10.1186/s13059-020-02168-z ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 7

Author(s):

Heng Li ◽

Xiaowen Feng ◽

Chong Chu

Keyword(s):

Data Model ◽

Reference Genome ◽

Structural Variants ◽

Current Reference ◽

Sequencing Technologies ◽

Recent Advances ◽

Multiple Genomes ◽

Design And Construction

Abstract The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.

Download Full-text

NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data

10.1101/092544 ◽

2016 ◽

Author(s):

Li Fang ◽

Jiang Hu ◽

Depeng Wang ◽

Kai Wang

Keyword(s):

Whole Genome ◽

Ashkenazi Jewish ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Human Genomes ◽

Long Read ◽

Personal Genomes ◽

Low Coverage

AbstractBackgroundStructural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is needed and how to optimally use the aligners and SV callers.ResultsIn this study, we developed NextSV, a meta-caller to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purposes. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 score) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5% to 94.1% for deletions and 87.9% to 93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset.ConclusionsOur results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data.

Download Full-text

Extensive sequencing of seven human genomes to characterize benchmark reference materials

10.1101/026468 ◽

2015 ◽

Cited By ~ 9

Author(s):

Justin M Zook ◽

David Catoe ◽

Jennifer McDaniel ◽

Lindsay Vang ◽

Noah Spies ◽

...

Keyword(s):

Human Genome ◽

Reference Materials ◽

De Novo ◽

Variant Calling ◽

Genome Project ◽

Genome Comparison ◽

Personal Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Download Full-text

2-kupl: mapping-free variant detection from DNA-seq data of matched samples

BMC Bioinformatics ◽

10.1186/s12859-021-04185-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yunfeng Wang ◽

Haoliang Xue ◽

Christine Pourcel ◽

Yang Du ◽

Daniel Gautheret

Keyword(s):

Dna Sequences ◽

Reference Genome ◽

Point Mutations ◽

Variant Calling ◽

Low Complexity ◽

Structural Variants ◽

Sequencing Data ◽

Bacterial Strains ◽

Two Samples ◽

Variant Detection

Abstract Background The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. Results We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. Conclusions We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.

Download Full-text

Tools and best practices for retrotransposon analysis using high-throughput sequencing data

Mobile DNA ◽

10.1186/s13100-019-0192-1 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 4

Author(s):

Aurélie Teissandier ◽

Nicolas Servant ◽

Emmanuel Barillot ◽

Deborah Bourc’his

Keyword(s):

Transposable Elements ◽

Transposable Element ◽

Molecular Mechanisms ◽

High Throughput Sequencing ◽

Reference Genome ◽

Repetitive Sequences ◽

Simulated Data ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

Abstract Background Sequencing technologies give access to a precise picture of the molecular mechanisms acting upon genome regulation. One of the biggest technical challenges with sequencing data is to map millions of reads to a reference genome. This problem is exacerbated when dealing with repetitive sequences such as transposable elements that occupy half of the mammalian genome mass. Sequenced reads coming from these regions introduce ambiguities in the mapping step. Therefore, applying dedicated parameters and algorithms has to be taken into consideration when transposable elements regulation is investigated with sequencing datasets. Results Here, we used simulated reads on the mouse and human genomes to define the best parameters for aligning transposable element-derived reads on a reference genome. The efficiency of the most commonly used aligners was compared and we further evaluated how transposable element representation should be estimated using available methods. The mappability of the different transposon families in the mouse and the human genomes was calculated giving an overview into their evolution. Conclusions Based on simulated data, we provided recommendations on the alignment and the quantification steps to be performed when transposon expression or regulation is studied, and identified the limits in detecting specific young transposon families of the mouse and human genomes. These principles may help the community to adopt standard procedures and raise awareness of the difficulties encountered in the study of transposable elements.

Download Full-text

Robust Cancer Mutation Detection with Deep Learning Models Derived from Tumor-Normal Sequencing Data

10.1101/667261 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sayed Mohammad Ebrahim Sahraeian ◽

Li Tai Fang ◽

Marghoob Mohiyuddin ◽

Huixiao Hong ◽

Wenming Xiao

Keyword(s):

Deep Learning ◽

Somatic Mutations ◽

Mutation Detection ◽

Sequencing Data ◽

Target Sequencing ◽

Sequencing Technologies ◽

Cancer Mutation ◽

Detection Approach ◽

Genomic Regions ◽

Reference Samples

AbstractAccurate detection of somatic mutations is challenging but critical to the understanding of cancer formation, progression, and treatment. We recently proposed NeuSomatic, the first deep convolutional neural network based somatic mutation detection approach and demonstrated performance advantages on in silico data. In this study, we used the first comprehensive and well-characterized somatic reference samples from the SEQC-II consortium to investigate best practices for utilizing deep learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for these reference samples by the consortium, we identified strategies for building robust models on multiple datasets derived from samples representing real scenarios. The proposed strategies achieved high robustness across multiple sequencing technologies such as WGS, WES, AmpliSeq target sequencing for fresh and FFPE DNA input, varying tumor/normal purities, and different coverages (ranging from 10× - 2000×). NeuSomatic significantly outperformed conventional detection approaches in general, as well as in challenging situations such as low coverage, low mutation frequency, DNA damage, and difficult genomic regions.

Download Full-text

Chromosome-level assembly of the mustache toad genome using third-generation DNA sequencing and Hi-C analysis

GigaScience ◽

10.1093/gigascience/giz114 ◽

2019 ◽

Vol 8 (9) ◽

Cited By ~ 7

Author(s):

Yongxin Li ◽

Yandong Ren ◽

Dongru Zhang ◽

Hui Jiang ◽

Zhongkai Wang ◽

...

Keyword(s):

Breeding Season ◽

Reference Genome ◽

Gene Families ◽

Sequencing Data ◽

High Quality ◽

Chromosome Conformation ◽

Functional Studies ◽

Sequencing Technologies ◽

A Genome ◽

Chromosome Level

Abstract Background The mustache toad, Vibrissaphora ailaonica, is endemic to China and belongs to the Megophryidae family. Like other mustache toad species, V. ailaonica males temporarily develop keratinized nuptial spines on their upper jaw during each breeding season, which fall off at the end of the breeding season. This feature is likely result of the reversal of sexual dimorphism in body size, with males being larger than females. A high-quality reference genome for the mustache toad would be invaluable to investigate the genetic mechanism underlying these repeatedly developing keratinized spines. Findings To construct the mustache toad genome, we generated 225 Gb of short reads and 277 Gb of long reads using Illumina and Pacific Biosciences (PacBio) sequencing technologies, respectively. Sequencing data were assembled into a 3.53-Gb genome assembly, with a contig N50 length of 821 kb. We also used high-throughput chromosome conformation capture (Hi-C) technology to identify contacts between contigs, then assembled contigs into scaffolds and assembled a genome with 13 chromosomes and a scaffold N50 length of 412.42 Mb. Based on the 26,227 protein-coding genes annotated in the genome, we analyzed phylogenetic relationships between the mustache toad and other chordate species. The mustache toad has a relatively higher evolutionary rate and separated from a common ancestor of the marine toad, bullfrog, and Tibetan frog 206.1 million years ago. Furthermore, we identified 201 expanded gene families in the mustache toad, which were mainly enriched in immune pathway, keratin filament, and metabolic processes. Conclusions Using Illumina, PacBio, and Hi-C technologies, we constructed the first high-quality chromosome-level mustache toad genome. This work not only offers a valuable reference genome for functional studies of mustache toad traits but also provides important chromosomal information for wider genome comparisons.

Download Full-text

Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information

10.1101/674804 ◽

2019 ◽

Cited By ~ 2

Author(s):

Hui-Su Kim ◽

Sungwon Jeon ◽

Changjae Kim ◽

Yeon Kyung Kim ◽

Yun Sung Cho ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

Sequencing Data ◽

Smrt Sequencing ◽

Human Genome Assembly

AbstractBackgroundLong DNA reads produced by single molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short read DNA fragments. For de novo assembly, PacBio and Oxford Nanopore Technologies (ONT) are favorite options. However, PacBio’s SMRT sequencing is expensive for a full human genome assembly and costs over 40,000 USD for 30x coverage as of 2019. ONT PromethION sequencing, on the other hand, is one-twelfth the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio’s SMRT sequencing in relation to the quality.FindingsWe performed whole genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64x coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mbp and a total genome length of 2.8 Gbp. It was comparable to a KOREF assembly constructed using PacBio at 62x coverage (188 Gbp, 2,695 contigs and N50s of 17.9 Mbp). When we applied Hi-C-derived long-range mapping data, an even higher quality assembly for the 64x coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mbp.ConclusionThe pore-based PromethION approach provides a good quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and is more cost-effective than PacBio at comparable quality measurements.

Download Full-text

The genomic landscape of polymorphic human nuclear mitochondrial insertions

10.1101/008144 ◽

2014 ◽

Author(s):

Gargi Dayama ◽

Sarah B Emery ◽

Jeffrey M Kidd ◽

Ryan E Mills

Keyword(s):

High Throughput Sequencing ◽

Reference Genome ◽

Current Knowledge ◽

Genetic Material ◽

Purifying Selection ◽

Genomic Variation ◽

Human Populations ◽

Sequencing Data ◽

Sequencing Technologies ◽

D Loop

The transfer of mitochondrial genetic material into the nuclear genomes of eukaryotes is a well-established phenomenon. Many studies over the past decade have utilized reference genome sequences of numerous species to characterize the prevalence and contribution of nuclear mitochondrial insertions to human diseases. The recent advancement of high throughput sequencing technologies has enabled the interrogation of genomic variation at a much finer scale, and now allows for an exploration into the diversity of polymorphic nuclear mitochondrial insertions (NumtS) in human populations. We have developed an approach to discover and genotype previously undiscovered Numt insertions using whole genome, paired-end sequencing data. We have applied this method to almost a thousand individuals in twenty populations from the 1000 Genomes Project and other data sets and identified 138 novel sites of Numt insertions, extending our current knowledge of existing Numt locations in the human genome by almost 20%. Most of the newly identified NumtS were found in less than 1% of the samples we examined, suggesting that they occur infrequently in nature or have been rapidly removed by purifying selection. We find that recent Numt insertions are derived from throughout the mitochondrial genome, including the D-loop, and have integration biases consistent with previous studies on older, fixed NumtS in the reference genome. We have further determined the complete inserted sequence for a subset of these events to define their age and origin of insertion as well as their potential impact on studies of mitochondrial heteroplasmy.

Download Full-text

Chromosome-scale assembly comparison of the Korean Reference Genome KOREF from PromethION and PacBio with Hi-C mapping information

GigaScience ◽

10.1093/gigascience/giz125 ◽

2019 ◽

Vol 8 (12) ◽

Cited By ~ 6

Author(s):

Hui-Su Kim ◽

Sungwon Jeon ◽

Changjae Kim ◽

Yeon Kyung Kim ◽

Yun Sung Cho ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Genome Assembly ◽

Reference Genome ◽

De Novo ◽

Low Cost ◽

Cost Effective ◽

Sequencing Data ◽

Smrt Sequencing ◽

Human Genome Assembly

Abstract Background Long DNA reads produced by single-molecule and pore-based sequencers are more suitable for assembly and structural variation discovery than short-read DNA fragments. For de novo assembly, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the favorite options. However, PacBio's SMRT sequencing is expensive for a full human genome assembly and costs more than $40,000 US for 30× coverage as of 2019. ONT PromethION sequencing, on the other hand, is 1/12 the price of PacBio for the same coverage. This study aimed to compare the cost-effectiveness of ONT PromethION and PacBio's SMRT sequencing in relation to the quality. Findings We performed whole-genome de novo assemblies and comparison to construct an improved version of KOREF, the Korean reference genome, using sequencing data produced by PromethION and PacBio. With PromethION, an assembly using sequenced reads with 64× coverage (193 Gb, 3 flowcell sequencing) resulted in 3,725 contigs with N50s of 16.7 Mb and a total genome length of 2.8 Gb. It was comparable to a KOREF assembly constructed using PacBio at 62× coverage (188 Gb, 2,695 contigs, and N50s of 17.9 Mb). When we applied Hi-C–derived long-range mapping data, an even higher quality assembly for the 64× coverage was achieved, resulting in 3,179 scaffolds with an N50 of 56.4 Mb. Conclusion The pore-based PromethION approach provided a high-quality chromosome-scale human genome assembly at a low cost with long maximum contig and scaffold lengths and was more cost-effective than PacBio at comparable quality measurements.

Download Full-text