MoMI-G: Modular Multi-scale Integrated Genome Graph Browser

Mapping Intimacies ◽

10.1101/540120 ◽

2019 ◽

Cited By ~ 1

Author(s):

Toshiyuki T. Yokoyama ◽

Yoshitaka Sakamoto ◽

Masahide Seki ◽

Yutaka Suzuki ◽

Masahiro Kasahara

Keyword(s):

Whole Genome ◽

Structural Variants ◽

Nucleotide Level ◽

Web Based ◽

Multi Scale ◽

Manual Inspection ◽

Long Reads ◽

Long Read ◽

Visualization Tools ◽

Genome Graph

ABSTRACTLong-read sequencing allows more sensitive and accurate discovery of structural variants (SVs). While more and more SVs are being identified, a number of them are difficult to visualize using existing SV visualization tools. Therefore, methods to visualize SVs such as nested or large SVs of over a megabase pair need to be developed. To this end, we developed MOdular Multi-scale Integrated Genome graph browser, MoMI-G, a web-based genome browser to visualize SVs, genes, repeats, and other annotations as a variation graph with paths. This browser allows more intuitive recognition of large, nested, and potentially more complex SVs. MoMI-G has view modules for different scales, which allow users to view the whole genome down to nucleotide-level alignments of long reads. Alignments spanning reference alleles and those spanning alternative alleles are shown in the same view. Users can customize the view, if they are not satisfied with the preset views. In addition, MoMI-G has Interval Card Deck, a feature for rapid manual inspection of hundreds of SVs. Herein, we describe the utility of MoMI-G by using representative examples of large and nested SVs found in two cell lines, LC-2/ad and CHM1. MoMI-G is freely available at https://github.com/MoMI-G/MoMI-G under the MIT license.

Download Full-text

MoMI-G: modular multi-scale integrated genome graph browser

BMC Bioinformatics ◽

10.1186/s12859-019-3145-2 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Toshiyuki T. Yokoyama ◽

Yoshitaka Sakamoto ◽

Masahide Seki ◽

Yutaka Suzuki ◽

Masahiro Kasahara

Keyword(s):

Human Cancer ◽

Read Depth ◽

Structural Variants ◽

Structural Variations ◽

Multi Scale ◽

Long Reads ◽

A Genome ◽

Long Read ◽

Complex Structural ◽

Genome Graph

Abstract Background Genome graph is an emerging approach for representing structural variants on genomes with branches. For example, representing structural variants of cancer genomes as a genome graph is more natural than representing such genomes as differences from the linear reference genome. While more and more structural variants are being identified by long-read sequencing, many of them are difficult to visualize using existing structural variants visualization tools. To this end, visualization method for large genome graphs such as human cancer genome graphs is demanded. Results We developed MOdular Multi-scale Integrated Genome graph browser, MoMI-G, a web-based genome graph browser that can visualize genome graphs with structural variants and supporting evidences such as read alignments, read depth, and annotations. This browser allows more intuitive recognition of large, nested, and potentially more complex structural variations. MoMI-G has view modules for different scales, which allow users to view the whole genome down to nucleotide-level alignments of long reads. Alignments spanning reference alleles and those spanning alternative alleles are shown in the same view. Users can customize the view, if they are not satisfied with the preset views. In addition, MoMI-G has Interval Card Deck, a feature for rapid manual inspection of hundreds of structural variants. Herein, we describe the utility of MoMI-G by using representative examples of large and nested structural variations found in two cell lines, LC-2/ad and CHM1. Conclusions Users can inspect complex and large structural variations found by long-read analysis in large genomes such as human genomes more smoothly and more intuitively. In addition, users can easily filter out false positives by manually inspecting hundreds of identified structural variants with supporting long-read alignments and annotations in a short time. Software availability MoMI-G is freely available at https://github.com/MoMI-G/MoMI-G under the MIT license.

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

StrVCTVRE: A supervised learning method to predict the pathogenicity of human structural variants

10.1101/2020.05.15.097048 ◽

2020 ◽

Author(s):

Andrew G. Sharo ◽

Zhiqiang Hu ◽

Steven E. Brenner

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Diagnostic Methods ◽

Training Dataset ◽

Disease Genes ◽

Whole Genome ◽

Structural Variants ◽

Coding Region ◽

Diagnostic Potential ◽

Long Read

AbstractWhole genome sequencing resolves clinical cases where standard diagnostic methods have failed. However, preliminary studies show that at least half of these cases still remain unresolved, even after whole genome sequencing. Structural variants (genomic variants larger than 50 base pairs) of uncertain significance may be the genetic cause of a portion of these unresolved cases. Historically, structural variants (SVs) have been difficult to detect with confidence from short-read sequencing. As both detection algorithms and long-read/linked-read sequencing methods become more accessible, clinical researchers will have access to thousands of reliable SVs of unknown disease relevance. Filtering these SVs by overlap with cataloged SVs is an imperfect solution. Innovative methods to predict the pathogenicity of these SVs will be needed to realize the full diagnostic potential of long-read sequencing. To address this emerging need, we developed StrVCTVRE (Structural Variant Classifier Trained on Variants Rare and Exonic), a classifier that can be used to distinguish pathogenic SVs from benign SVs that overlap exons. We made use of features that capture gene importance, coding region, conservation, expression, and exon structure in a random forest classifier. We found that some features, such as expression and conservation, are important but are absent from SV classification guidelines. Although databases of SVs reflect size biases from sequencing techniques, we leveraged multiple databases to construct a size-matched training set of rare, putatively benign and pathogenic SVs. In independent test sets, we found our method performs accurately across a wide SV size range, which will allow clinical researchers to eliminate nearly 60% of SVs from consideration at an elevated sensitivity of 90%. However, our method and its assessment are still constrained by a small training dataset and acquisition bias in databases of pathogenic variants. StrVCTVRE fills an empty niche in the clinical evaluation of SVs of unknown significance. We anticipate researchers will use it to prioritize SVs in patients where no variant is immediately compelling, empowering deeper investigation into novel SVs and disease genes to resolve cases.

Download Full-text

NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data

10.1101/092544 ◽

2016 ◽

Author(s):

Li Fang ◽

Jiang Hu ◽

Depeng Wang ◽

Kai Wang

Keyword(s):

Whole Genome ◽

Ashkenazi Jewish ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Human Genomes ◽

Long Read ◽

Personal Genomes ◽

Low Coverage

AbstractBackgroundStructural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is needed and how to optimally use the aligners and SV callers.ResultsIn this study, we developed NextSV, a meta-caller to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purposes. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 score) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5% to 94.1% for deletions and 87.9% to 93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset.ConclusionsOur results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data.

Download Full-text

Highly-accurate long-read sequencing improves variant detection and assembly of a human genome

10.1101/519025 ◽

2019 ◽

Cited By ~ 27

Author(s):

Aaron M. Wenger ◽

Paul Peluso ◽

William J. Rowell ◽

Pi-Chuan Chang ◽

Richard J. Hall ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Short Reads ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Variant Detection ◽

High Quality Genome ◽

Circular Consensus Sequencing

AbstractThe major DNA sequencing technologies in use today produce either highly-accurate short reads or noisy long reads. We developed a protocol based on single-molecule, circular consensus sequencing (CCS) to generate highly-accurate (99.8%) long reads averaging 13.5 kb and applied it to sequence the well-characterized human HG002/NA24385. We optimized existing tools to comprehensively detect variants, achieving precision and recall above 99.91% for SNVs, 95.98% for indels, and 95.99% for structural variants. We estimate that 2,434 discordances are correctable mistakes in the high-quality Genome in a Bottle benchmark. Nearly all (99.64%) variants are phased into haplotypes, which further improves variant detection. De novo assembly produces a highly contiguous and accurate genome with contig N50 above 15 Mb and concordance of 99.998%. CCS reads match short reads for small variant detection, while enabling structural variant detection and de novo assembly at similar contiguity and markedly higher concordance than noisy long reads.

Download Full-text

Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing

Nature Communications ◽

10.1038/s41467-019-12493-y ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 26

Author(s):

Peter Edge ◽

Vikas Bansal

Keyword(s):

Single Molecule ◽

Variant Calling ◽

Small Scale ◽

Whole Genome ◽

Limited Information ◽

Single Nucleotide Variants ◽

Pacific Biosciences ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read

Abstract Whole-genome sequencing using sequencing technologies such as Illumina enables the accurate detection of small-scale variants but provides limited information about haplotypes and variants in repetitive regions of the human genome. Single-molecule sequencing (SMS) technologies such as Pacific Biosciences and Oxford Nanopore generate long reads that can potentially address the limitations of short-read sequencing. However, the high error rate of SMS reads makes it challenging to detect small-scale variants in diploid genomes. We introduce a variant calling method, Longshot, which leverages the haplotype information present in SMS reads to accurately detect and phase single-nucleotide variants (SNVs) in diploid genomes. We demonstrate that Longshot achieves very high accuracy for SNV detection using whole-genome Pacific Biosciences data, outperforms existing variant calling methods, and enables variant detection in duplicated regions of the genome that cannot be mapped using short reads.

Download Full-text

Whole-genome sequencing with long reads reveals complex structure and origin of structural variation in human genetic variations and somatic mutations in cancer

Genome Medicine ◽

10.1186/s13073-021-00883-1 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Akihiro Fujimoto ◽

Jing Hao Wong ◽

Yukiko Yoshii ◽

Shintaro Akiyama ◽

Azusa Tanaka ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Somatic Mutations ◽

Complex Structure ◽

Whole Genome ◽

Structural Variations ◽

Link Type ◽

Long Reads ◽

Liver Cancers ◽

Long Read

AbstractBackgroundIdentification of germline variation and somatic mutations is a major issue in human genetics. However, due to the limitations of DNA sequencing technologies and computational algorithms, our understanding of genetic variation and somatic mutations is far from complete.MethodsIn the present study, we performed whole-genome sequencing using long-read sequencing technology (Oxford Nanopore) for 11 Japanese liver cancers and matched normal samples which were previously sequenced for the International Cancer Genome Consortium (ICGC). We constructed an analysis pipeline for the long-read data and identified germline and somatic structural variations (SVs).ResultsIn polymorphic germline SVs, our analysis identified 8004 insertions, 6389 deletions, 27 inversions, and 32 intra-chromosomal translocations. By comparing to the chimpanzee genome, we correctly inferred events that caused insertions and deletions and found that most insertions were caused by transposons andAluis the most predominant source, while other types of insertions, such as tandem duplications and processed pseudogenes, are rare. We inferred mechanisms of deletion generations and found that most non-allelic homolog recombination (NAHR) events were caused by recombination errors in SINEs. Analysis of somatic mutations in liver cancers showed that long reads could detect larger numbers of SVs than a previous short-read study and that mechanisms of cancer SV generation were different from that of germline deletions.ConclusionsOur analysis provides a comprehensive catalog of polymorphic and somatic SVs, as well as their possible causes. Our software are available athttps://github.com/afujimoto/CAMPHORandhttps://github.com/afujimoto/CAMPHORsomatic.

Download Full-text

Robust Benchmark Structural Variant Calls of An Asian Using the State-of-Art Long Fragment Sequencing Technologies

10.1101/2020.08.10.245308 ◽

2020 ◽

Author(s):

Xiao Du ◽

Lili Li ◽

Fan Liang ◽

Sanyang Liu ◽

Wenxin Zhang ◽

...

Keyword(s):

B Lymphocyte ◽

De Novo ◽

Structural Variants ◽

High Confidence ◽

False Negatives ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Circular Consensus Sequencing

AbstractThe importance of structural variants (SVs) on phenotypes and human diseases is now recognized. Although a variety of SV detection platforms and strategies that vary in sensitivity and specificity have been developed, few benchmarking procedures are available to confidently assess their performances in biological and clinical research. To facilitate the validation and application of those approaches, our work established an Asian reference material comprising identified benchmark regions and high-confidence SV calls. We established a high-confidence SV callset with 8,938 SVs in an EBV immortalized B lymphocyte line, by integrating four alignment-based SV callers [from 109× PacBio continuous long read (CLR), 22× PacBio circular consensus sequencing (CCS) reads, 104× Oxford Nanopore long reads, and 114× optical mapping platform (Bionano)] and one de novo assembly-based SV caller using CCS reads. A total of 544 randomly selected SVs were validated by PCR and Sanger sequencing, proofing the robustness of our SV calls. Combining trio-binning based haplotype assemblies, we established an SV benchmark for identification of false negatives and false positives by constructing the continuous high confident regions (CHCRs), which cover 1.46Gb and 6,882 SVs supported by at least one diploid haplotype assembly. Establishing high-confidence SV calls for a benchmark sample that has been characterized by multiple technologies provides a valuable resource for investigating SVs in human biology, disease, and clinical diagnosis.

Download Full-text

Jasmine: Population-scale structural variant comparison and analysis

10.1101/2021.05.27.445886 ◽

2021 ◽

Author(s):

Melanie Kirsche ◽

Gautam Prabhu ◽

Rachel Sherman ◽

Bohan Ni ◽

Sergey Aganezov ◽

...

Keyword(s):

De Novo ◽

Population Analysis ◽

Accurate Method ◽

Structural Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Dna And Rna ◽

Long Reads ◽

Long Read ◽

Proximity Graph

The increasing availability of long-reads is revolutionizing studies of structural variants (SVs). However, because SVs vary across individuals and are discovered through imprecise read technologies and methods, they can be difficult to compare. Addressing this, we present Jasmine (https://github.com/mkirsche/Jasmine), a fast and accurate method for SV refinement, comparison, and population analysis. Using an SV proximity graph, Jasmine outperforms five widely-used comparison methods, including reducing the rate of Mendelian discordance in trio datasets by more than five-fold, and reveals a set of high confidence de novo SVs confirmed by multiple long-read technologies. We also present a harmonized callset of 205,192 SVs from 31 samples of diverse ancestry sequenced with long reads. We genotype these SVs in 444 short read samples from the 1000 Genomes Project with both DNA and RNA sequencing data and assess their widespread impact on gene expression, including within several medically relevant genes.

Download Full-text

Complex Structural Variants Resolved by Short-Read and Long-Read Whole Genome Sequencing in Mendelian Disorders

10.1101/281683 ◽

2018 ◽

Cited By ~ 2

Author(s):

Alba Sanchis-Juan ◽

Jonathan Stephens ◽

Courtney E French ◽

Nicholas Gleadall ◽

Karyn Mégy ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo ◽

Genomic Variation ◽

Mendelian Disease ◽

Whole Genome ◽

Structural Variants ◽

Short Read ◽

Long Read ◽

Complex Structural

AbstractComplex structural variants (cxSVs) are genomic rearrangements comprising multiple structural variants, typically involving three or more breakpoint junctions. They contribute to human genomic variation and can cause Mendelian disease, however they are not typically considered during genetic testing. Here, we investigate the role of cxSVs in Mendelian disease using short-read whole genome sequencing (WGS) data from 1,324 individuals with neurodevelopmental or retinal disorders from the NIHR BioResource project. We present four cases of individuals with a cxSV affecting Mendelian disease-associated genes. Three of the cxSVs are pathogenic: a de novo duplication-inversion-inversion-deletion affecting ARID1B in an individual with Coffin-Siris syndrome, a deletion-inversion-duplication affecting HNRNPU in an individual with intellectual disability and seizures, and a homozygous deletion-inversion-deletion affecting CEP78 in an individual with cone-rod dystrophy. Additionally, we identified a de novo duplication-inversion-duplication overlapping CDKL5 in an individual with neonatal hypoxic-ischaemic encephalopathy. Long-read sequencing technology used to resolve the breakpoints demonstrated the presence of both a disrupted and an intact copy of CDKL5 on the same allele; therefore, it was classified as a variant of uncertain significance. Analysis of sequence flanking all breakpoint junctions in all the cxSVs revealed both microhomology and longer repetitive sequences, suggesting both replication and homology based processes. Accurate resolution of cxSVs is essential for clinical interpretation, and here we demonstrate that long-read WGS is a powerful technology by which to achieve this. Our results show cxSVs are an important although rare cause of Mendelian disease, and we therefore recommend their consideration during research and clinical investigations.

Download Full-text