Haplotype-resolved diverse human genomes and integrated analysis of structural variation

Peter Ebert; Peter A. Audano; Qihui Zhu; Bernardo Rodriguez-Martin; David Porubsky; Marc Jan Bonder; Arvis Sulovari; Jana Ebler; Weichen Zhou; Rebecca Serra Mari; Feyza Yilmaz; Xuefang Zhao; PingHsun Hsieh; Joyce Lee; Sushant Kumar; Jiadong Lin; Tobias Rausch; Yu Chen; Jingwen Ren; Martin Santamarina; Wolfram Höps; Hufsah Ashraf; Nelson T. Chuang; Xiaofei Yang; Katherine M. Munson; Alexandra P. Lewis; Susan Fairley; Luke J. Tallon; Wayne E. Clarke; Anna O. Basile; Marta Byrska-Bishop; André Corvelo; Uday S. Evani; Tsung-Yu Lu; Mark J. P. Chaisson; Junjie Chen; Chong Li; Harrison Brand; Aaron M. Wenger; Maryam Ghareghani; William T. Harvey; Benjamin Raeder; Patrick Hasenfeld; Allison A. Regier; Haley J. Abel; Ira M. Hall; Paul Flicek; Oliver Stegle; Mark B. Gerstein; Jose M. C. Tubio; Zepeng Mu; Yang I. Li; Xinghua Shi; Alex R. Hastie; Kai Ye; Zechen Chong; Ashley D. Sanders; Michael C. Zody; Michael E. Talkowski; Ryan E. Mills; Scott E. Devine; Charles Lee; Jan O. Korbel; Tobias Marschall; Evan E. Eichler

doi:10.1126/science.abf7117

Haplotype-resolved diverse human genomes and integrated analysis of structural variation

Science ◽

10.1126/science.abf7117 ◽

2021 ◽

Vol 372 (6537) ◽

pp. eabf7117 ◽

Cited By ~ 4

Author(s):

Peter Ebert ◽

Peter A. Audano ◽

Qihui Zhu ◽

Bernardo Rodriguez-Martin ◽

David Porubsky ◽

...

Keyword(s):

Structural Variation ◽

De Novo ◽

Mobile Element ◽

Integrated Analysis ◽

Base Pairs ◽

Adaptive Selection ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Read

Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.

Download Full-text

De novo assembly of 64 haplotype-resolved human genomes of diverse ancestry and integrated analysis of structural variation

10.1101/2020.12.16.423102 ◽

2020 ◽

Author(s):

Peter Ebert ◽

Peter A. Audano ◽

Qihui Zhu ◽

Bernardo Rodriguez-Martin ◽

David Porubsky ◽

...

Keyword(s):

Structural Variation ◽

De Novo ◽

Mobile Element ◽

Integrated Analysis ◽

Adaptive Selection ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Histocompatibility Complex ◽

Human Genomes ◽

Long Read

AbstractLong-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent–child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average contig N50: 26 Mbp) integrate all forms of genetic variation across even complex loci such as the major histocompatibility complex. We focus on 107,590 structural variants (SVs), of which 68% are inaccessible by short-read sequencing. We identify new SV hotspots (spanning megabases of gene-rich sequence), characterize 130 of the most active mobile element source elements, and find that 63% of all SVs arise by homology-mediated mechanisms—a twofold increase from previous studies. Our resource now enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1,525 expression quantitative trait loci (SV-eQTLs) as well as SV candidates for adaptive selection within the human population.

Download Full-text

Comparative Annotation Toolkit (CAT) - simultaneous clade and personal genome annotation

10.1101/231118 ◽

2017 ◽

Cited By ~ 6

Author(s):

Ian T. Fiddes ◽

Joel Armstrong ◽

Mark Diekhans ◽

Stefanie Nachtweide ◽

Zev N. Kronenberg ◽

...

Keyword(s):

Genome Annotation ◽

De Novo ◽

Low Cost ◽

Great Apes ◽

Personal Genome ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Read ◽

Genome Assemblies ◽

Rat Genome

ABSTRACTThe recent introductions of low-cost, long-read, and read-cloud sequencing technologies coupled with intense efforts to develop efficient algorithms have made affordable, high-quality de novo sequence assembly a realistic proposition. The result is an explosion of new, ultra-contiguous genome assemblies. To compare these genomes we need robust methods for genome annotation. We describe the fully open source Comparative Annotation Toolkit (CAT), which provides a flexible way to simultaneously annotate entire clades and identify orthology relationships. We show that CAT can be used to improve annotations on the rat genome, annotate the great apes, annotate a diverse set of mammals, and annotate personal, diploid human genomes. We demonstrate the resulting discovery of novel genes, isoforms and structural variants, even in genomes as well studied as rat and the great apes, and how these annotations improve cross-species RNA expression experiments.

Download Full-text

Cas9 targeted enrichment of mobile elements using nanopore sequencing

10.1101/2021.02.10.430605 ◽

2021 ◽

Author(s):

Torrin L. McDonald ◽

Weichen Zhou ◽

Christopher Castro ◽

Camille Mumm ◽

Jessica A. Switzenberg ◽

...

Keyword(s):

Genetic Disorders ◽

Mobile Element ◽

Flow Cell ◽

Read Length ◽

Nanopore Sequencing ◽

Short Read Sequencing ◽

Human Genomes ◽

Long Read ◽

Genomic Regions ◽

Targeted Enrichment

AbstractMobile element insertions (MEIs) are highly repetitive genomic sequences that contribute to inter- and intra-individual genetic variation and can lead to genetic disorders. Targeted and whole-genome approaches using short-read sequencing have been developed to identify reference and non-reference MEIs; however, the read length hampers detection of these elements in complex genomic regions. Here, we pair Cas9 targeted nanopore sequencing with computational methodologies to capture active MEIs in human genomes. We demonstrate parallel enrichment for distinct classes of MEIs, averaging 44% of reads on targeted signals. We show an individual flow cell can recover a remarkable fraction of MEIs (97% L1Hs, 93% AluYb, 51% AluYa, 99% SVA_F, and 65% SVA_E). We identify twenty-one non-reference MEIs in GM12878 overlooked by modern, long-read analysis pipelines, primarily in repetitive genomic regions. This work introduces the utility of nanopore sequencing for MEI enrichment and lays the foundation for rapid discovery of elusive, repetitive genetic elements.

Download Full-text

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

10.1101/840447 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alex Di Genova ◽

Elena Buena-Atienza ◽

Stephan Ossowski ◽

Marie-France Sagot

Keyword(s):

De Novo ◽

Computational Cost ◽

Sequence Information ◽

Sequencing Data ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes. However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50:16.67-62.06 Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan

Download Full-text

Cas9 targeted enrichment of mobile elements using nanopore sequencing

Nature Communications ◽

10.1038/s41467-021-23918-y ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Torrin L. McDonald ◽

Weichen Zhou ◽

Christopher P. Castro ◽

Camille Mumm ◽

Jessica A. Switzenberg ◽

...

Keyword(s):

Genetic Disorders ◽

Mobile Element ◽

Read Length ◽

Whole Genome ◽

Nanopore Sequencing ◽

Short Read Sequencing ◽

Human Genomes ◽

Long Read ◽

Genomic Regions ◽

Targeted Enrichment

AbstractMobile element insertions (MEIs) are repetitive genomic sequences that contribute to genetic variation and can lead to genetic disorders. Targeted and whole-genome approaches using short-read sequencing have been developed to identify reference and non-reference MEIs; however, the read length hampers detection of these elements in complex genomic regions. Here, we pair Cas9-targeted nanopore sequencing with computational methodologies to capture active MEIs in human genomes. We demonstrate parallel enrichment for distinct classes of MEIs, averaging 44% of reads on-targeted signals and exhibiting a 13.4-54x enrichment over whole-genome approaches. We show an individual flow cell can recover most MEIs (97% L1Hs, 93% AluYb, 51% AluYa, 99% SVA_F, and 65% SVA_E). We identify seventeen non-reference MEIs in GM12878 overlooked by modern, long-read analysis pipelines, primarily in repetitive genomic regions. This work introduces the utility of nanopore sequencing for MEI enrichment and lays the foundation for rapid discovery of elusive, repetitive genetic elements.

Download Full-text

MiniScrub: de novo long read scrubbing using approximate alignment and deep learning

10.1101/433573 ◽

2018 ◽

Cited By ~ 1

Author(s):

Nathan LaPierre ◽

Rob Egan ◽

Wei Wang ◽

Zhong Wang

Keyword(s):

Open Source Software ◽

Genome Assembly ◽

Large Scale ◽

Structural Variation ◽

De Novo ◽

Error Rates ◽

Sequencing Technologies ◽

Oxford Nanopore ◽

Long Read ◽

Reference Genomes

AbstractLong read sequencing technologies such as Oxford Nanopore can greatly de-crease the complexity of de novo genome assembly and large structural variation iden-tification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. Many methods for resolving these errors require access to reference genomes, high-fidelity short reads, or reference genomes, which are often not available. De novo error correction modules are available, often as part of assembly tools, but large-scale errors still remain in resulting assemblies, motivating further innovation in this area. We developed a novel Convolutional Neu-ral Network (CNN) based method, called MiniScrub, for de novo identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments. MiniScrub first generates read-to-read alignments by MiniMap, then encodes the alignments into images, and finally builds CNN models to predict low-quality segments that could be scrubbed based on a customized quality cutoff. Applying MiniScrub to real world con-trol datasets under several different parameters, we show that it robustly improves read quality. Compared to raw reads, de novo genome assembly with scrubbed reads pro-duces many fewer mis-assemblies and large indel errors. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub

Download Full-text

Deep repeat resolution—the assembly of the Drosophila Histone Complex

Nucleic Acids Research ◽

10.1093/nar/gky1194 ◽

2018 ◽

Vol 47 (3) ◽

pp. e18-e18 ◽

Cited By ~ 2

Author(s):

Philipp Bongartz ◽

Siegfried Schloissnig

Keyword(s):

De Novo ◽

Machine Learning Algorithms ◽

Single Nucleotide Variants ◽

Major Step ◽

Base Pairs ◽

Sequencing Technologies ◽

Long Reads ◽

Wide Range ◽

Long Read ◽

Genome Assemblies

Abstract Though the advent of long-read sequencing technologies has led to a leap in contiguity of de novo genome assemblies, current reference genomes of higher organisms still do not provide unbroken sequences of complete chromosomes. Despite reads in excess of 30 000 base pairs, there are still repetitive structures that cannot be resolved by current state-of-the-art assemblers. The most challenging of these structures are tandemly arrayed repeats, which occur in the genomes of all eukaryotes. Untangling tandem repeat clusters is exceptionally difficult, since the rare differences between repeat copies are obscured by the high error rate of long reads. Solving this problem would constitute a major step towards computing fully assembled genomes. Here, we demonstrate by example of the Drosophila Histone Complex that via machine learning algorithms, it is possible to exploit the underlying distinguishing patterns of single nucleotide variants of repeats from very noisy data to resolve a large and highly conserved repeat cluster. The ideas explored in this paper are a first step towards the automated assembly of complex repeat structures and promise to be applicable to a wide range of eukaryotic genomes.

Download Full-text

MUM&Co: accurate detection of all SV types through whole-genome alignment

Bioinformatics ◽

10.1093/bioinformatics/btaa115 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3242-3243 ◽

Cited By ~ 2

Author(s):

Samuel O’Donnell ◽

Gilles Fischer

Keyword(s):

De Novo ◽

Supplementary Information ◽

Genome Alignment ◽

Whole Genome ◽

Structural Variations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Human Genomes ◽

Whole Genome Alignment ◽

Primary Output

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Mapping and phasing of structural variation in patient genomes using nanopore sequencing

10.1101/129379 ◽

2017 ◽

Cited By ~ 4

Author(s):

Mircea Cretu Stancu ◽

Markus J. van Roosmalen ◽

Ivo Renkens ◽

Marleen Nieboer ◽

Sjors Middelkamp ◽

...

Keyword(s):

Single Molecule ◽

De Novo ◽

Structural Variants ◽

Human Genetic Disease ◽

Structural Genomic ◽

Short Read ◽

Sequencing Technologies ◽

Genome Wide ◽

Long Read ◽

Complex Structural

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.

Download Full-text

De novo assembly of a Tibetan genome and identification of novel structural variants associated with high-altitude adaptation

National Science Review ◽

10.1093/nsr/nwz160 ◽

2019 ◽

Vol 7 (2) ◽

pp. 391-402 ◽

Cited By ~ 3

Author(s):

Yaoxi He ◽

Haiyi Lou ◽

Chaoying Cui ◽

Lian Deng ◽

Yang Gao ◽

...

Keyword(s):

High Altitude ◽

De Novo Assembly ◽

De Novo ◽

Population Analysis ◽

Extreme Environments ◽

Structural Variants ◽

Base Pairs ◽

High Quality ◽

Long Read ◽

Altitude Adaptation

Abstract Structural variants (SVs) may play important roles in human adaptation to extreme environments such as high altitude but have been under-investigated. Here, combining long-read sequencing with multiple scaffolding techniques, we assembled a high-quality Tibetan genome (ZF1), with a contig N50 length of 24.57 mega-base pairs (Mb) and a scaffold N50 length of 58.80 Mb. The ZF1 assembly filled 80 remaining N-gaps (0.25 Mb in total length) in the reference human genome (GRCh38). Markedly, we detected 17 900 SVs, among which the ZF1-specific SVs are enriched in GTPase activity that is required for activation of the hypoxic pathway. Further population analysis uncovered a 163-bp intronic deletion in the MKL1 gene showing large divergence between highland Tibetans and lowland Han Chinese. This deletion is significantly associated with lower systolic pulmonary arterial pressure, one of the key adaptive physiological traits in Tibetans. Moreover, with the use of the high-quality de novo assembly, we observed a much higher rate of genome-wide archaic hominid (Altai Neanderthal and Denisovan) shared non-reference sequences in ZF1 (1.32%–1.53%) compared to other East Asian genomes (0.70%–0.98%), reflecting a unique genomic composition of Tibetans. One such archaic hominid shared sequence—a 662-bp intronic insertion in the SCUBE2 gene—is enriched and associated with better lung function (the FEV1/FVC ratio) in Tibetans. Collectively, we generated the first high-resolution Tibetan reference genome, and the identified SVs may serve as valuable resources for future evolutionary and medical studies.

Download Full-text