Structural variants selected during yak domestication inferred from long-read whole-genome sequencing

Molecular Biology and Evolution ◽

10.1093/molbev/msab134 ◽

2021 ◽

Author(s):

Shangzhe Zhang ◽

Wenyu Liu ◽

Xinfeng Liu ◽

Xin Du ◽

Ke Zhang ◽

...

Keyword(s):

Artificial Selection ◽

Reference Genome ◽

Genetic Resource ◽

Repetitive Sequences ◽

Geographic Range ◽

Whole Genome ◽

System Behavior ◽

Structural Variants ◽

Animal Domestication ◽

Long Read

Abstract Structural variants (SVs) represent an important genetic resource for both natural and artificial selection. Here we present a chromosome-scale reference genome for domestic yak (Bos grunniens) that has longer contigs and scaffolds (N50 44.72Mb and 114.39 Mb, respectively) than reported for any other ruminant genome. We further obtained long-read resequencing data for 6 wild and 23 domestic yaks and constructed a genetic SV map of 37,220 SVs that covers the geographic range of the yaks. The majority of the SVs contains repetitive sequences and several are in or near genes. By comparing SVs in domestic and wild yaks, we identified genes that are predominantly related to the nervous system, behavior, immunity and reproduction and may have been targeted by artificial selection during yak domestication. These findings provide new insights in the domestication of animals living at high altitude and highlight the importance of SVs in animal domestication.

Download Full-text

StrVCTVRE: A supervised learning method to predict the pathogenicity of human structural variants

10.1101/2020.05.15.097048 ◽

2020 ◽

Author(s):

Andrew G. Sharo ◽

Zhiqiang Hu ◽

Steven E. Brenner

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Diagnostic Methods ◽

Training Dataset ◽

Disease Genes ◽

Whole Genome ◽

Structural Variants ◽

Coding Region ◽

Diagnostic Potential ◽

Long Read

AbstractWhole genome sequencing resolves clinical cases where standard diagnostic methods have failed. However, preliminary studies show that at least half of these cases still remain unresolved, even after whole genome sequencing. Structural variants (genomic variants larger than 50 base pairs) of uncertain significance may be the genetic cause of a portion of these unresolved cases. Historically, structural variants (SVs) have been difficult to detect with confidence from short-read sequencing. As both detection algorithms and long-read/linked-read sequencing methods become more accessible, clinical researchers will have access to thousands of reliable SVs of unknown disease relevance. Filtering these SVs by overlap with cataloged SVs is an imperfect solution. Innovative methods to predict the pathogenicity of these SVs will be needed to realize the full diagnostic potential of long-read sequencing. To address this emerging need, we developed StrVCTVRE (Structural Variant Classifier Trained on Variants Rare and Exonic), a classifier that can be used to distinguish pathogenic SVs from benign SVs that overlap exons. We made use of features that capture gene importance, coding region, conservation, expression, and exon structure in a random forest classifier. We found that some features, such as expression and conservation, are important but are absent from SV classification guidelines. Although databases of SVs reflect size biases from sequencing techniques, we leveraged multiple databases to construct a size-matched training set of rare, putatively benign and pathogenic SVs. In independent test sets, we found our method performs accurately across a wide SV size range, which will allow clinical researchers to eliminate nearly 60% of SVs from consideration at an elevated sensitivity of 90%. However, our method and its assessment are still constrained by a small training dataset and acquisition bias in databases of pathogenic variants. StrVCTVRE fills an empty niche in the clinical evaluation of SVs of unknown significance. We anticipate researchers will use it to prioritize SVs in patients where no variant is immediately compelling, empowering deeper investigation into novel SVs and disease genes to resolve cases.

Download Full-text

NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data

10.1101/092544 ◽

2016 ◽

Author(s):

Li Fang ◽

Jiang Hu ◽

Depeng Wang ◽

Kai Wang

Keyword(s):

Whole Genome ◽

Ashkenazi Jewish ◽

Structural Variants ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Human Genomes ◽

Long Read ◽

Personal Genomes ◽

Low Coverage

AbstractBackgroundStructural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is needed and how to optimally use the aligners and SV callers.ResultsIn this study, we developed NextSV, a meta-caller to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purposes. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 score) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5% to 94.1% for deletions and 87.9% to 93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset.ConclusionsOur results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data.

Download Full-text

Progress towards a reference genome for sunflower

Botany ◽

10.1139/b11-032 ◽

2011 ◽

Vol 89 (7) ◽

pp. 429-437 ◽

Cited By ~ 60

Author(s):

N.C. Kane ◽

N. Gill ◽

M.G. King ◽

J.E. Bowers ◽

H. Berges ◽

...

Keyword(s):

Reference Genome ◽

Physical Map ◽

Repetitive Sequences ◽

Whole Genome Shotgun ◽

Whole Genome ◽

Horticultural Crops ◽

Physical Maps ◽

Linear Assembly ◽

Sequencing Strategy ◽

Sunflower Genome

The Compositae is one of the largest and most economically important families of flowering plants and includes a diverse array of food crops, horticultural crops, medicinals, and noxious weeds. Despite its size and economic importance, there is no reference genome sequence for the Compositae, which impedes research and improvement efforts. We report on progress toward sequencing the 3.5 Gb genome of cultivated sunflower ( Helianthus annuus ), the most important crop in the family. Our sequencing strategy combines whole-genome shotgun sequencing using the Solexa and 454 platforms with the generation of high-density genetic and physical maps that serve as scaffolds for the linear assembly of whole-genome shotgun sequences. The performance of this approach is enhanced by the construction of a sequence-based physical map, which provides unique sequence-based tags every 5–6 kb across the genome. Thus far, our physical map covers ∼85% of the sunflower genome, and we have generated ∼80× genome coverage with Solexa reads and 15.5× with 454 reads. Preliminary analyses indicated that ∼78% of the sunflower genome consists of repetitive sequences. Nonetheless, ∼76% of contigs >5 kb in size can be assigned to either the physical or genetic map or to both, suggesting that our approach is likely to deliver a highly accurate and contiguous reference genome for sunflower.

Download Full-text

Long-Read Genome Assembly of Saccharomyces uvarum Strain CBS 7001

Microbiology Resource Announcements ◽

10.1128/mra.00972-21 ◽

2022 ◽

Author(s):

Jingxuan Chen ◽

David J. Garfinkel ◽

Casey M. Bergman

Keyword(s):

Genome Assembly ◽

Reference Genome ◽

Sequence Data ◽

Sensu Stricto ◽

Whole Genome ◽

Saccharomyces Uvarum ◽

Content Type ◽

Whole Genome Shotgun Sequence ◽

Long Read ◽

Genome Shotgun Sequence

Here, we report a long-read genome assembly for Saccharomyces uvarum strain CBS 7001 based on PacBio whole-genome shotgun sequence data. Our assembly provides an improved reference genome for an important yeast in the Saccharomyces sensu stricto clade.

Download Full-text

Fast and accurate reference-guided scaffolding of draft genomes

10.1101/519637 ◽

2019 ◽

Cited By ~ 13

Author(s):

Michael Alonge ◽

Sebastian Soyk ◽

Srividya Ramakrishnan ◽

Xingang Wang ◽

Sara Goodwin ◽

...

Keyword(s):

Open Source ◽

Genome Analysis ◽

Reference Genome ◽

De Novo ◽

Genetic Maps ◽

Alternative Methods ◽

Structural Variants ◽

Pan Genome ◽

Long Read ◽

Increasing Demand

AbstractBackgroundAs the number of new genome assemblies continues to grow, there is increasing demand for methods to coalesce contigs from draft assemblies into pseudomolecules. Most current methods use genetic maps, optical maps, chromatin conformation (Hi-C), or other long-range linking data, however these data are expensive and analysis methods often fail to accurately order and orient a high percentage of assembly contigs. Other approaches utilize alignments to a reference genome for ordering and orienting, however these tools rely on slow aligners and are not robust to repetitive contigs.ResultsWe present RaGOO, an open-source reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in just minutes. With the pseudomolecules constructed, RaGOO identifies structural variants, including those spanning sequencing gaps that are not reported by alternative methods. We show that RaGOO accurately orders and orients contigs into nearly complete chromosomes based on de novo assemblies of Oxford Nanopore long-read sequencing from three wild and domesticated tomato genotypes, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open-source with an MIT license at https://github.com/malonge/RaGOO.ConclusionsWe demonstrate that with a highly contiguous assembly and a structurally accurate reference genome, reference-guided scaffolding with RaGOO outperforms error-prone reference-free methods and enable rapid pan-genome analysis.

Download Full-text

Complex Structural Variants Resolved by Short-Read and Long-Read Whole Genome Sequencing in Mendelian Disorders

10.1101/281683 ◽

2018 ◽

Cited By ~ 2

Author(s):

Alba Sanchis-Juan ◽

Jonathan Stephens ◽

Courtney E French ◽

Nicholas Gleadall ◽

Karyn Mégy ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

De Novo ◽

Genomic Variation ◽

Mendelian Disease ◽

Whole Genome ◽

Structural Variants ◽

Short Read ◽

Long Read ◽

Complex Structural

AbstractComplex structural variants (cxSVs) are genomic rearrangements comprising multiple structural variants, typically involving three or more breakpoint junctions. They contribute to human genomic variation and can cause Mendelian disease, however they are not typically considered during genetic testing. Here, we investigate the role of cxSVs in Mendelian disease using short-read whole genome sequencing (WGS) data from 1,324 individuals with neurodevelopmental or retinal disorders from the NIHR BioResource project. We present four cases of individuals with a cxSV affecting Mendelian disease-associated genes. Three of the cxSVs are pathogenic: a de novo duplication-inversion-inversion-deletion affecting ARID1B in an individual with Coffin-Siris syndrome, a deletion-inversion-duplication affecting HNRNPU in an individual with intellectual disability and seizures, and a homozygous deletion-inversion-deletion affecting CEP78 in an individual with cone-rod dystrophy. Additionally, we identified a de novo duplication-inversion-duplication overlapping CDKL5 in an individual with neonatal hypoxic-ischaemic encephalopathy. Long-read sequencing technology used to resolve the breakpoints demonstrated the presence of both a disrupted and an intact copy of CDKL5 on the same allele; therefore, it was classified as a variant of uncertain significance. Analysis of sequence flanking all breakpoint junctions in all the cxSVs revealed both microhomology and longer repetitive sequences, suggesting both replication and homology based processes. Accurate resolution of cxSVs is essential for clinical interpretation, and here we demonstrate that long-read WGS is a powerful technology by which to achieve this. Our results show cxSVs are an important although rare cause of Mendelian disease, and we therefore recommend their consideration during research and clinical investigations.

Download Full-text

MoMI-G: Modular Multi-scale Integrated Genome Graph Browser

10.1101/540120 ◽

2019 ◽

Cited By ~ 1

Author(s):

Toshiyuki T. Yokoyama ◽

Yoshitaka Sakamoto ◽

Masahide Seki ◽

Yutaka Suzuki ◽

Masahiro Kasahara

Keyword(s):

Whole Genome ◽

Structural Variants ◽

Nucleotide Level ◽

Web Based ◽

Multi Scale ◽

Manual Inspection ◽

Long Reads ◽

Long Read ◽

Visualization Tools ◽

Genome Graph

ABSTRACTLong-read sequencing allows more sensitive and accurate discovery of structural variants (SVs). While more and more SVs are being identified, a number of them are difficult to visualize using existing SV visualization tools. Therefore, methods to visualize SVs such as nested or large SVs of over a megabase pair need to be developed. To this end, we developed MOdular Multi-scale Integrated Genome graph browser, MoMI-G, a web-based genome browser to visualize SVs, genes, repeats, and other annotations as a variation graph with paths. This browser allows more intuitive recognition of large, nested, and potentially more complex SVs. MoMI-G has view modules for different scales, which allow users to view the whole genome down to nucleotide-level alignments of long reads. Alignments spanning reference alleles and those spanning alternative alleles are shown in the same view. Users can customize the view, if they are not satisfied with the preset views. In addition, MoMI-G has Interval Card Deck, a feature for rapid manual inspection of hundreds of SVs. Herein, we describe the utility of MoMI-G by using representative examples of large and nested SVs found in two cell lines, LC-2/ad and CHM1. MoMI-G is freely available at https://github.com/MoMI-G/MoMI-G under the MIT license.

Download Full-text

svviz: a read viewer for validating structural variants

10.1101/016063 ◽

2015 ◽

Cited By ~ 1

Author(s):

Noah Spies ◽

Justin M Zook ◽

Marc Salit ◽

Arend Sidow

Keyword(s):

Open Source ◽

High Throughput Sequencing ◽

Reference Genome ◽

Structural Variants ◽

Insert Size ◽

Reference Allele ◽

Sequencing Platform ◽

Mate Pair ◽

Oxford Nanopore ◽

Long Read

Visualizing read alignments is the most effective way to validate candidate SVs with existing data. We present svviz, a sequencing read visualizer for structural variants (SVs) that sorts and displays only reads relevant to a candidate SV. svviz works by searching input bam(s) for potentially relevant reads, realigning them against the inferred sequence of the putative variant allele as well as the reference allele, and identifying reads that match one allele better than the other. Reads are assigned to the proper allele based on alignment score, read pair orientation and insert size. Separate views of the two alleles are then displayed in a scrollable web browser view, enabling a more intuitive visualization of each allele, compared to the single reference genome-based view common to most current read browsers. The web view facilitates examining the evidence for or against a putative variant, estimating zygosity, visualizing affected genomic annotations, and manual refinement of breakpoints. An optional command-line-only interface allows summary statistics and graphics to be exported directly to standard graphics file formats. svviz is open source and freely available from github, and requires as input only structural variant coordinates (called using any other software package), reads in bam format, and a reference genome. Reads from any high-throughput sequencing platform are supported, including Illumina short-read, mate-pair, synthetic long-read (assembled), Pacific Biosciences, and Oxford Nanopore. svviz is open source and freely available from https://github.com/svviz/svviz.

Download Full-text

Identification of High-Confidence Structural Variants in Domesticated Rainbow Trout Using Whole-Genome Sequencing

Frontiers in Genetics ◽

10.3389/fgene.2021.639355 ◽

2021 ◽

Vol 12 ◽

Author(s):

Sixin Liu ◽

Guangtu Gao ◽

Ryan M. Layer ◽

Gary H. Thorgaard ◽

Gregory D. Wiens ◽

...

Keyword(s):

Rainbow Trout ◽

Repetitive Dna ◽

Dna Content ◽

Repetitive Sequences ◽

Principal Component ◽

Whole Genome ◽

Structural Variants ◽

High Confidence ◽

Breeding Populations ◽

Or Gene

Genomic structural variants (SVs) are a major source of genetic and phenotypic variation but have not been investigated systematically in rainbow trout (Oncorhynchus mykiss), an important aquaculture species of cold freshwater. The objectives of this study were 1) to identify and validate high-confidence SVs in rainbow trout using whole-genome re-sequencing; and 2) to examine the contribution of transposable elements (TEs) to SVs in rainbow trout. A total of 96 rainbow trout, including 11 homozygous lines and 85 outbred fish from three breeding populations, were whole-genome sequenced with an average genome coverage of 17.2×. Putative SVs were identified using the program Smoove which integrates LUMPY and other associated tools into one package. After rigorous filtering, 13,863 high-confidence SVs were identified. Pacific Biosciences long-reads of Arlee, one of the homozygous lines used for SV detection, validated 98% (3,948 of 4,030) of the high-confidence SVs identified in the Arlee homozygous line. Based on principal component analysis, the 85 outbred fish clustered into three groups consistent with their populations of origin, further indicating that the high-confidence SVs identified in this study are robust. The repetitive DNA content of the high-confidence SV sequences was 86.5%, which is much higher than the 57.1% repetitive DNA content of the reference genome, and is also higher than the repetitive DNA content of Atlantic salmon SVs reported previously. TEs thus contribute substantially to SVs in rainbow trout as TEs make up the majority of repetitive sequences. Hundreds of the high-confidence SVs were annotated as exon-loss or gene-fusion variants, and may have phenotypic effects. The high-confidence SVs reported in this study provide a foundation for further rainbow trout SV studies.

Download Full-text

Whole genome sequencing of sporadic Burkitt lymphoma in HIV-infected and uninfected patients.

Journal of Clinical Oncology ◽

10.1200/jco.2013.31.15_suppl.8577 ◽

2013 ◽

Vol 31 (15_suppl) ◽

pp. 8577-8577

Author(s):

Deborah Ritter ◽

Kimberly Walker ◽

Myoung Kwon ◽

Premal Lulla ◽

Catherine M. Bollard ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Dna Sequences ◽

Burkitt Lymphoma ◽

Structural Variation ◽

Reference Genome ◽

Statistical Significance ◽

Whole Genome ◽

Average Insert Size ◽

Structural Variants

8577 Background: Burkitt Lymphoma is defined by canonical translocations between MYC and immunoglobulin IgH, IgK or IgL (8:14, 8:2, 8:22, respectively), and is commonly associated with HIV. The identification of HIV from sequenced samples is critical to understanding HIV-associated Burkitt Lymphoma. While recent novel gene mutations (ID3 and TCF3) have been implicated in functional roles, concomitant genomic structural variants and the interaction of HIV with structural variation is less well defined. Methods: We sequenced the whole genomes of 15 patients with 100bp paired-end reads on Illumina Hi-Seq platform, resulting in an average insert size of 278 (+/- 63) and coverage of 60X tumor and 30X normal. We included 7 HIV-negative, and 8 HIV-positive subjects. Sequencing reads were mapped to the reference genome using BWA. Large-scale structural variation was detected by the BreakDancer and Crest programs. Functional annotation was used to prioritize structural variants for validation. Single nucleotide variants and small insertions and deletions were detected by CARNAC, a somatic variation discovery pipeline. The subset of WGS reads that failed to align to the human reference genome were tested for the presence of HIV sequences by comparing the unmapped reads to a database of viral DNA sequences which included the common subtypes of HIV defined by Los Alamos. Reads matching HIV or EBV with an expectation value of <10-4 were analyzed to determine virus coverage and viral integration sites. Results: Canonical MYC-IgH translocations were identified in 9/15 (60%) tumor samples, with 2 additional subjects harboring either a deletion or an inversion near exon1 of MYC; 4 had no MYC rearrangement. MYC translocations occurred equally in both groups. TP53 and SMARC4 point mutations were observed recurrently in the HIV uninfected group but not in the HIV infected patients. Variable levels of HIV DNA sequence were observed in normal tissue of all HIV infected patients. Conclusions: Whole genome sequencing has identified known somatic variants in HIV infected and uninfected patients. Two genes, TP53 and SMARC4, appear to be differentially mutated, but additional samples are needed to achieve statistical significance.

Download Full-text