Long-read whole genome sequencing identifies causal structural variation in a Mendelian disease

Mapping Intimacies ◽

10.1101/090985 ◽

2016 ◽

Cited By ~ 7

Author(s):

Jason D. Merker ◽

Aaron M. Wenger ◽

Tam Sneddon ◽

Megan Grove ◽

Daryl Waggott ◽

...

Keyword(s):

Large Scale ◽

Structural Variation ◽

Tandem Repeats ◽

Diagnostic Yield ◽

Carney Complex ◽

Mendelian Disease ◽

Whole Genome ◽

Structural Variants ◽

Unrelated Control ◽

Long Read

AbstractCurrent clinical genomics assays primarily utilize short-read sequencing (SRS), which offers high throughput, high base accuracy, and low cost per base. SRS has, however, limited ability to evaluate tandem repeats, regions with high [GC] or [AT] content, highly polymorphic regions, highly paralogous regions, and large-scale structural variants. Long-read sequencing (LRS) has complementary strengths and offers a means to discover overlooked genetic variation in patients undiagnosed by SRS. To evaluate LRS, we selected a patient who presented with multiple neoplasia and cardiac myxomata suggestive of Carney complex for whom targeted clinical gene testing and whole genome SRS were negative. Low coverage whole genome LRS was performed on the PacBio Sequel system and structural variants were called, yielding 6,971 deletions and 6,821 insertions > 50bp. Filtering for variants that are absent in an unrelated control and that overlap a coding exon of a disease gene identified three deletions and three insertions. One of these, a heterozygous 2,184 bp deletion, overlaps the first coding exon of PRKAR1A, which is implicated in autosomal dominant Carney complex. This variant was confirmed by Sanger sequencing and was classified as pathogenic using standard criteria for the interpretation of sequence variants. This first successful application of whole genome LRS to identify a pathogenic variant suggests that LRS has significant potential to identify disease-causing structural variation. We recommend larger studies to evaluate the diagnostic yield of LRS, and the development of a comprehensive catalog of common human structural variation to support future studies.

Download Full-text

Expectations and blind spots for structural variation detection from short-read alignment and long-read assembly

10.1101/2020.07.03.168831 ◽

2020 ◽

Cited By ~ 3

Author(s):

Xuefang Zhao ◽

Ryan L. Collins ◽

Wan-Ping Lee ◽

Alexandra M. Weber ◽

Yukyung Jun ◽

...

Keyword(s):

Single Molecule ◽

Large Scale ◽

Structural Variation ◽

Human Genetics ◽

Clinical Diagnostics ◽

Added Value ◽

Mendelian Disease ◽

Segmental Duplications ◽

Genomic Context ◽

Long Read

AbstractVirtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and emerging clinical diagnostic approaches utilize short-reads (srWGS), which present constraints for genome-wide discovery of structural variants (SVs). Alternative long-read single molecule technologies (lrWGS) offer significant advantages for genome assembly and SV detection, while these technologies are currently cost prohibitive for large-scale disease studies and clinical diagnostics (∼5-12X higher cost than comparable coverage srWGS). Moreover, only dozens of such genomes are currently publicly accessible by comparison to millions of srWGS genomes that have been commissioned for international initiatives. Given this ubiquitous reliance on srWGS in human genetics and genomics, we sought to characterize and quantify the properties of SVs accessible to both srWGS and lrWGS to establish benchmarks and expectations in ongoing medical and population genetic studies, and to project the added value of SVs uniquely accessible to each technology. In analyses of three trios with matched srWGS and lrWGS from the Human Genome Structural Variation Consortium (HGSVC), srWGS captured ∼11,000 SVs per genome using reference-based algorithms, while haplotype-resolved assembly from lrWGS identified ∼25,000 SVs per genome. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplications (SD) and simple repeats (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of the human reference, we observed extremely high concordance (93.8%) for deletions discovered by srWGS and lrWGS after error correction using the raw lrWGS reads. Conversely, lrWGS was superior for detection of insertions across all genomic contexts. Given that the non-SD/SR sequences span 90.3% of the GRCh38 reference, and encompass 95.9% of coding exons in currently annotated disease associated genes, improved sensitivity from lrWGS to discover novel and interpretable pathogenic deletions not already accessible to srWGS is likely to be incremental. However, these analyses highlight the added value of assembly-based lrWGS to create new catalogues of functional insertions and transposable elements, as well as disease associated repeat expansions in genomic regions previously recalcitrant to routine assessment.

Download Full-text

Complex Structural Variants Resolved by Short-Read and Long-Read Whole Genome Sequencing in Mendelian Disorders

10.1101/281683 ◽

2018 ◽

Cited By ~ 2

Author(s):

Alba Sanchis-Juan ◽

Jonathan Stephens ◽

Courtney E French ◽

Nicholas Gleadall ◽

Karyn Mégy ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo ◽

Genomic Variation ◽

Mendelian Disease ◽

Whole Genome ◽

Structural Variants ◽

Short Read ◽

Long Read ◽

Complex Structural

AbstractComplex structural variants (cxSVs) are genomic rearrangements comprising multiple structural variants, typically involving three or more breakpoint junctions. They contribute to human genomic variation and can cause Mendelian disease, however they are not typically considered during genetic testing. Here, we investigate the role of cxSVs in Mendelian disease using short-read whole genome sequencing (WGS) data from 1,324 individuals with neurodevelopmental or retinal disorders from the NIHR BioResource project. We present four cases of individuals with a cxSV affecting Mendelian disease-associated genes. Three of the cxSVs are pathogenic: a de novo duplication-inversion-inversion-deletion affecting ARID1B in an individual with Coffin-Siris syndrome, a deletion-inversion-duplication affecting HNRNPU in an individual with intellectual disability and seizures, and a homozygous deletion-inversion-deletion affecting CEP78 in an individual with cone-rod dystrophy. Additionally, we identified a de novo duplication-inversion-duplication overlapping CDKL5 in an individual with neonatal hypoxic-ischaemic encephalopathy. Long-read sequencing technology used to resolve the breakpoints demonstrated the presence of both a disrupted and an intact copy of CDKL5 on the same allele; therefore, it was classified as a variant of uncertain significance. Analysis of sequence flanking all breakpoint junctions in all the cxSVs revealed both microhomology and longer repetitive sequences, suggesting both replication and homology based processes. Accurate resolution of cxSVs is essential for clinical interpretation, and here we demonstrate that long-read WGS is a powerful technology by which to achieve this. Our results show cxSVs are an important although rare cause of Mendelian disease, and we therefore recommend their consideration during research and clinical investigations.

Download Full-text

NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION

Genome Biology ◽

10.1186/s13059-019-1856-3 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 10

Author(s):

Arne De Roeck ◽

Wouter De Coster ◽

Liene Bossaerts ◽

Rita Cacace ◽

Tim De Pooter ◽

...

Keyword(s):

Tandem Repeat ◽

Large Scale ◽

Tandem Repeats ◽

Current Data ◽

Flip Flop ◽

Base Calling ◽

Oxford Nanopore ◽

Long Read ◽

Technological Limitations ◽

Repeat Assessment

AbstractTechnological limitations have hindered the large-scale genetic investigation of tandem repeats in disease. We show that long-read sequencing with a single Oxford Nanopore Technologies PromethION flow cell per individual achieves 30× human genome coverage and enables accurate assessment of tandem repeats including the 10,000-bp Alzheimer’s disease-associated ABCA7 VNTR. The Guppy “flip-flop” base caller and tandem-genotypes tandem repeat caller are efficient for large-scale tandem repeat assessment, but base calling and alignment challenges persist. We present NanoSatellite, which analyzes tandem repeats directly on electric current data and improves calling of GC-rich tandem repeats, expanded alleles, and motif interruptions.

Download Full-text

Population-level genome-wide STR typing in Plasmodium species reveals higher resolution population structure and genetic diversity relative to SNP typing

10.1101/2021.05.19.444768 ◽

2021 ◽

Author(s):

Jiru Han ◽

Jacob E Munro ◽

Anthony Kocoski ◽

Alyssa E Barry ◽

Melanie Bahlo

Keyword(s):

Genetic Diversity ◽

Large Scale ◽

Tandem Repeats ◽

Plasmodium Species ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Genome Wide ◽

Field Samples

Short tandem repeats (STRs) are highly informative genetic markers that have been used extensively in population genetics analysis. They are an important source of genetic diversity and can also have functional impact. Despite the availability of bioinformatic methods that permit large-scale genome-wide genotyping of STRs from whole genome sequencing data, they have not previously been applied to sequencing data from large collections of malaria parasite field samples. Here, we have genotyped STRs using HipSTR in more than 3,000 Plasmodium falciparum and 174 Plasmodium vivax published whole-genome sequence data from samples collected across the globe. High levels of noise and variability in the resultant callset necessitated the development of a novel method for quality control of STR genotype calls. A set of high-quality STR loci (6,768 from P. falciparum and 3,496 from P. vivax) were used to study Plasmodium genetic diversity, population structures and genomic signatures of selection and these were compared to genome-wide single nucleotide polymorphism (SNP) genotyping data. In addition, the genome-wide information about genetic variation and other characteristics of STRs in P. falciparum and P. vivax have been made available in an interactive web-based R Shiny application PlasmoSTR (https://github.com/bahlolab/PlasmoSTR).

Download Full-text

High-throughput multiplexed tandem repeat genotyping using targeted long-read sequencing

10.1101/673251 ◽

2019 ◽

Cited By ~ 1

Author(s):

Devika Ganesamoorthy ◽

Mengjia Yan ◽

Valentine Murigneux ◽

Chenxi Zhou ◽

Minh Duc Cao ◽

...

Keyword(s):

Large Scale ◽

Tandem Repeats ◽

Gc Content ◽

Genomic Variation ◽

Population Variation ◽

Small Subset ◽

Sequence Coverage ◽

Copy Numbers ◽

Long Read ◽

Highly Correlated

ABSTRACTTandem repeats (TRs) are highly prone to variation in copy numbers due to their repetitive and unstable nature, which makes them a major source of genomic variation between individuals. However, population variation of TRs have not been widely explored due to the limitations of existing tools, which are either low-throughput or restricted to a small subset of TRs. Here, we used SureSelect targeted sequencing approach combined with Nanopore sequencing to overcome these limitations. We achieved an average of 3062-fold target enrichment on a panel of 142 TR loci, generating an average of 97X sequence coverage on 7 samples utilizing 2 MinION flow-cells with 200ng of input DNA per sample. We identified a subset of 110 TR loci with length less than 2kb, and GC content greater than 25% for which we achieved an average genotyping rate of 75% and increasing to 91% for the highest-coverage sample. Alleles estimated from targeted long-read sequencing were concordant with gold standard PCR sizing analysis and moreover highly correlated with alleles estimated from whole genome long-read sequencing. We demonstrate a targeted long-read sequencing approach that enables simultaneous analysis of hundreds of TRs and accuracy is comparable to PCR sizing analysis. Our approach is feasible to scale for more targets and more samples facilitating large-scale analysis of TRs.

Download Full-text

StrVCTVRE: A supervised learning method to predict the pathogenicity of human structural variants

10.1101/2020.05.15.097048 ◽

2020 ◽

Author(s):

Andrew G. Sharo ◽

Zhiqiang Hu ◽

Steven E. Brenner

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Diagnostic Methods ◽

Training Dataset ◽

Disease Genes ◽

Whole Genome ◽

Structural Variants ◽

Coding Region ◽

Diagnostic Potential ◽

Long Read

AbstractWhole genome sequencing resolves clinical cases where standard diagnostic methods have failed. However, preliminary studies show that at least half of these cases still remain unresolved, even after whole genome sequencing. Structural variants (genomic variants larger than 50 base pairs) of uncertain significance may be the genetic cause of a portion of these unresolved cases. Historically, structural variants (SVs) have been difficult to detect with confidence from short-read sequencing. As both detection algorithms and long-read/linked-read sequencing methods become more accessible, clinical researchers will have access to thousands of reliable SVs of unknown disease relevance. Filtering these SVs by overlap with cataloged SVs is an imperfect solution. Innovative methods to predict the pathogenicity of these SVs will be needed to realize the full diagnostic potential of long-read sequencing. To address this emerging need, we developed StrVCTVRE (Structural Variant Classifier Trained on Variants Rare and Exonic), a classifier that can be used to distinguish pathogenic SVs from benign SVs that overlap exons. We made use of features that capture gene importance, coding region, conservation, expression, and exon structure in a random forest classifier. We found that some features, such as expression and conservation, are important but are absent from SV classification guidelines. Although databases of SVs reflect size biases from sequencing techniques, we leveraged multiple databases to construct a size-matched training set of rare, putatively benign and pathogenic SVs. In independent test sets, we found our method performs accurately across a wide SV size range, which will allow clinical researchers to eliminate nearly 60% of SVs from consideration at an elevated sensitivity of 90%. However, our method and its assessment are still constrained by a small training dataset and acquisition bias in databases of pathogenic variants. StrVCTVRE fills an empty niche in the clinical evaluation of SVs of unknown significance. We anticipate researchers will use it to prioritize SVs in patients where no variant is immediately compelling, empowering deeper investigation into novel SVs and disease genes to resolve cases.

Download Full-text

NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data

10.1101/092544 ◽

2016 ◽

Author(s):

Li Fang ◽

Jiang Hu ◽

Depeng Wang ◽

Kai Wang

Keyword(s):

Whole Genome ◽

Ashkenazi Jewish ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Human Genomes ◽

Long Read ◽

Personal Genomes ◽

Low Coverage

AbstractBackgroundStructural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is needed and how to optimally use the aligners and SV callers.ResultsIn this study, we developed NextSV, a meta-caller to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purposes. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 score) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5% to 94.1% for deletions and 87.9% to 93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset.ConclusionsOur results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data.

Download Full-text

Long-read genome sequencing identifies causal structural variation in a Mendelian disease

Genetics in Medicine ◽

10.1038/gim.2017.86 ◽

2017 ◽

Vol 20 (1) ◽

pp. 159-163 ◽

Cited By ~ 81

Author(s):

Jason D Merker ◽

Aaron M Wenger ◽

Tam Sneddon ◽

Megan Grove ◽

Zachary Zappala ◽

...

Keyword(s):

Genome Sequencing ◽

Structural Variation ◽

Mendelian Disease ◽

Long Read

Download Full-text

A 12-kb structural variation in progressive myoclonic epilepsy was newly identified by long-read whole-genome sequencing

Journal of Human Genetics ◽

10.1038/s10038-019-0569-5 ◽

2019 ◽

Vol 64 (5) ◽

pp. 359-368 ◽

Cited By ~ 14

Author(s):

Takeshi Mizuguchi ◽

Takeshi Suzuki ◽

Chihiro Abe ◽

Ayako Umemura ◽

Katsushi Tokunaga ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Structural Variation ◽

Myoclonic Epilepsy ◽

Whole Genome ◽

Long Read ◽

Progressive Myoclonic Epilepsy

Download Full-text

Diagnostic Yield of Whole Genome Sequencing After Nondiagnostic Exome Sequencing or Gene Panel in Developmental and Epileptic Encephalopathies

Neurology ◽

10.1212/wnl.0000000000011655 ◽

2021 ◽

Vol 96 (13) ◽

pp. e1770-e1782

Author(s):

Elizabeth Emma Palmer ◽

Rani Sachdev ◽

Rebecca Macintosh ◽

Uirá Souto Melo ◽

Stefan Mundlos ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Diagnostic Yield ◽

Chromosomal Microarray ◽

Whole Genome ◽

Structural Variants ◽

Epileptic Encephalopathies ◽

Complex Structural ◽

Made In

ObjectiveTo assess the benefits and limitations of whole genome sequencing (WGS) compared to exome sequencing (ES) or multigene panel (MGP) in the molecular diagnosis of developmental and epileptic encephalopathies (DEE).MethodsWe performed WGS of 30 comprehensively phenotyped DEE patient trios that were undiagnosed after first-tier testing, including chromosomal microarray and either research ES (n = 15) or diagnostic MGP (n = 15).ResultsEight diagnoses were made in the 15 individuals who received prior ES (53%): 3 individuals had complex structural variants; 5 had ES-detectable variants, which now had additional evidence for pathogenicity. Eleven diagnoses were made in the 15 MGP-negative individuals (68%); the majority (n = 10) involved genes not included in the panel, particularly in individuals with postneonatal onset of seizures and those with more complex presentations including movement disorders, dysmorphic features, or multiorgan involvement. A total of 42% of diagnoses were autosomal recessive or X-chromosome linked.ConclusionWGS was able to improve diagnostic yield over ES primarily through the detection of complex structural variants (n = 3). The higher diagnostic yield was otherwise better attributed to the power of re-analysis rather than inherent advantages of the WGS platform. Additional research is required to assist in the assessment of pathogenicity of novel noncoding and complex structural variants and further improve diagnostic yield for patients with DEE and other neurogenetic disorders.

Download Full-text