Resolving the Full Spectrum of Human Genome Variation using Linked-Reads

Mapping Intimacies ◽

10.1101/230946 ◽

2017 ◽

Cited By ~ 8

Author(s):

Patrick Marks ◽

Sarah Garcia ◽

Alvaro Martinez Barrio ◽

Kamila Belhocine ◽

Jorge Bernate ◽

...

Keyword(s):

Human Genome ◽

Large Scale ◽

De Novo ◽

Simultaneous Detection ◽

Whole Genome ◽

Structural Variations ◽

Full Spectrum ◽

Short Read ◽

Short Reads ◽

A Genome

AbstractLarge-scale population based analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short read whole genome sequencing. However, standard short-read approaches, used primarily due to accuracy, throughput and costs, fail to give a complete picture of a genome. They struggle to identify large, balanced structural events, cannot access repetitive regions of the genome and fail to resolve the human genome into its two haplotypes. Here we describe an approach that retains long range information while harnessing the advantages of short reads. Starting from only ∼1ng of DNA, we produce barcoded short read libraries. The use of novel informatic approaches allows for the barcoded short reads to be associated with the long molecules of origin producing a novel datatype known as ‘Linked-Reads’. This approach allows for simultaneous detection of small and large variants from a single Linked-Read library. We have previously demonstrated the utility of whole genome Linked-Reads (lrWGS) for performing diploid, de novo assembly of individual genomes (Weisenfeld et al. 2017). In this manuscript, we show the advantages of Linked-Reads over standard short read approaches for reference based analysis. We demonstrate the ability of Linked-Reads to reconstruct megabase scale haplotypes and to recover parts of the genome that are typically inaccessible to short reads, including phenotypically important genes such as STRC, SMN1 and SMN2. We demonstrate the ability of both lrWGS and Linked-Read Whole Exome Sequencing (lrWES) to identify complex structural variations, including balanced events, single exon deletions, and single exon duplications. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.

Download Full-text

Family Trio-Based Whole Genome Optical Mapping Identifies Candidate Structural Variations Predisposing Children to Acute Lymphoblastic Leukemia

Blood ◽

10.1182/blood-2019-130014 ◽

2019 ◽

Vol 134 (Supplement_1) ◽

pp. 5201-5201

Author(s):

Ute Fischer ◽

Layal Yasin ◽

Julia Täubner ◽

Triantafyllia Brozou ◽

Arndt Borkhardt

Keyword(s):

Acute Lymphoblastic Leukemia ◽

Family History ◽

De Novo ◽

Lymphoblastic Leukemia ◽

Germline Mutations ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Short Read Sequencing ◽

Family Trio

Germline mutations account for a substantial proportion of childhood cancer and may critically affect disease characteristics, therapy efficacy, severity of treatment side effects and patient outcome. To date, only 8-10% of childhood cancer cases can be explained by germline mutations identified in known cancer predisposing genes. This is in part due to the technical limitation of next generation short read sequencing, which detects single nucleotide variants, small deletions/insertions or simple copy number variations, but is not a reliable tool to identify larger structural variations (SVs, >500 bp) which are frequent in the human genome and may impact on disease predisposition. Using whole genome optical mapping (WGOM) we aimed at identification of de novo and inherited germline SVs in a cohort of patients with clinically suspected cancer predisposition but without informative findings in short read sequencing analyses. After informed consent we performed family trio based short read (2x 100 bp) whole exome sequencing (WES) on a HiSeq2500 (Illumina) and collected clinical and demographic data for a cohort of >100 families with children affected by cancer who were treated in our hospital. About 25% of the patients either (1) had a family history indicative of cancer susceptibility, or (2) had accompanying clinical findings (e.g. developmental delay, congenital anomalies) or (3) experienced excessive toxicity during chemotherapy. From this subgroup we selected four patients with acute lymphoblastic leukemia whose sequencing data and routine genetic workup were not informative of a known cancer predisposing syndrome and employed family trio-based next generation WGOM on a Saphyr instrument equipped with Access software (Bionano Genomics) to identify genomic SVs. To this end, we extracted and labeled high molecular weight DNA molecules at specific hexamer sequence motifs (average distance: 5 kb) using a DNA methyltransferase-based direct labeling reaction. Imaging was carried out on single-molecule level and each sample genome was de novo assembled from molecule data. Consensus genome maps were clustered into two alleles and diploid assemblies created. Genomes of patients were compared to parental genomes and the GRCh38 reference genome. SVs were inferred from de novo assemblies and genome comparisons with respect to quality scores, overall molecule coverage, fraction of molecules displaying the SV event, and chimeric DNA fragment mapping. Specific SV calls were compared to a set of > 160 human control samples (provided by Bionano Genomics) to filter against common SVs and potential artifacts. Filtered SVs were annotated using structural variant and gene databases. Employing WGOM we analyzed DNA molecules 300.000 bp long on average and achieved genomic coverage ranging from 90-132x corresponding to 330-480 Gbp. For instance, for one patient, we obtained 1751 insertions, 624 deletions, 77 inversions, 21 duplications, 1 intra- and 2 inter-chromosomal translocations before filtering. The majority of these events (78%) were inherited from both parents. 20% were inherited from either father or mother and 2% were generated de novo. As the family history of this patient was inconspicuous for tumor diseases, we removed all inherited events and filtered against common variants. This resulted in only two candidate de novo lesions: a heterozygous 129,495 bp deletion framed by inversions (chr9: 66,156,733-66,622,623) in a gene-less region and a heterozygous inverted 352,667 bp duplication (chr22: 15,522,454-15.875,120) that spanned the genes OR11H, POTEH, POTEH-AS1, LINC01297, DUXAP8, and BMS1P22. Of these genes DUXAP8 is an oncogenic non-coding RNA of the homeobox gene family that has been associated with increased tumor growth and poorer prognosis in a wide variety of somatic cancers. It functions as a regulator of transcription by binding to key components of the developmental regulator epigenetic polycomb repressive complex 2 and may thus account for additional presentations of the child (dwarfism, accelerated skeletal age, linguistic developmental delay, morphological traits). Our results indicate that WGOM is a useful technology to identify candidate SVs in children predisposed to cancer and developmental syndromes. Several candidates are currently being tested and the results will be presented. Disclosures No relevant conflicts of interest to declare.

Download Full-text

Detection and assembly of novel sequence insertions using Linked-Read technology

10.1101/551028 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dmitry Meleshko ◽

Patrick Marks ◽

Stephen Williams ◽

Iman Hajirasouliha

Keyword(s):

Dna Sequences ◽

De Novo Assembly ◽

De Novo ◽

Supplementary Information ◽

Computational Techniques ◽

Whole Genome ◽

Structural Variations ◽

Short Read ◽

Link Type ◽

Long Read

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary

Download Full-text

Large-scale structural variation detection in subterranean clover subtypes using optical mapping validated at nucleotide level

10.1101/232132 ◽

2017 ◽

Author(s):

Yuxuan Yuan ◽

Zbyněk Milec ◽

Philipp E. Bayer ◽

Jan Vrána ◽

Jaroslav Doležel ◽

...

Keyword(s):

Single Molecule ◽

Large Scale ◽

Optical Mapping ◽

Subterranean Clover ◽

Structural Variations ◽

Recognition Sequence ◽

Nucleotide Level ◽

Total Size ◽

Short Reads ◽

A Genome

AbstractWhole genome sequencing has been widely used to detect structural variations (SVs). However, the limited single molecule size makes it difficult to characterize large-scale SVs in a genome because they cannot fully cover such vast and complex regions. Recently, optical mapping in nanochannels has provided novel resolution to detect large-scale SVs by comparing the physical location of the nickase recognition sequence in genomes. Other than in humans, SVs discovered in plants by optical mapping have not been validated. To assess the accuracy of SV calling in plants by optical mapping, we selected two genetically diverse subspecies of the Trifolium model species, subterranean clover cvs. Daliak and Yarloop. The SVs discovered by BioNano optical mapping (BOM) were validated using Illumina short reads. In the analysis, BOM identified 12 large-scale regions containing deletions and 19 containing insertions in Yarloop. The 12 large-scale regions contained 71 small deletions when validated by Illumina short reads. The results suggest that BOM could detect the total size of deletions and insertions, but it could not precisely report the location and actual quantity of SVs in the genome. Nucleotide-level validation is crucial to confirm and characterize SVs reported by optical mapping. The accuracy of SV detection by BOM is highly dependent on the quality of reference genomes and the density of selected nickases.

Download Full-text

A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome

Experimental & Molecular Medicine ◽

10.1038/s12276-021-00586-y ◽

2021 ◽

Author(s):

Seyoung Mun ◽

Songmi Kim ◽

Wooseok Lee ◽

Keunsoo Kang ◽

Thomas J. Meyer ◽

...

Keyword(s):

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Personal Genome ◽

Human Populations ◽

Whole Genome ◽

Structural Variations ◽

Insert Size ◽

Human Genomes ◽

Next Generation Sequencing Ngs

AbstractAdvances in next-generation sequencing (NGS) technology have made personal genome sequencing possible, and indeed, many individual human genomes have now been sequenced. Comparisons of these individual genomes have revealed substantial genomic differences between human populations as well as between individuals from closely related ethnic groups. Transposable elements (TEs) are known to be one of the major sources of these variations and act through various mechanisms, including de novo insertion, insertion-mediated deletion, and TE–TE recombination-mediated deletion. In this study, we carried out de novo whole-genome sequencing of one Korean individual (KPGP9) via multiple insert-size libraries. The de novo whole-genome assembly resulted in 31,305 scaffolds with a scaffold N50 size of 13.23 Mb. Furthermore, through computational data analysis and experimental verification, we revealed that 182 TE-associated structural variation (TASV) insertions and 89 TASV deletions contributed 64,232 bp in sequence gain and 82,772 bp in sequence loss, respectively, in the KPGP9 genome relative to the hg19 reference genome. We also verified structural differences associated with TASVs by comparative analysis with TASVs in recent genomes (AK1 and TCGA genomes) and reported their details. Here, we constructed a new Korean de novo whole-genome assembly and provide the first study, to our knowledge, focused on the identification of TASVs in an individual Korean genome. Our findings again highlight the role of TEs as a major driver of structural variations in human individual genomes.

Download Full-text

MUM&Co: accurate detection of all SV types through whole-genome alignment

Bioinformatics ◽

10.1093/bioinformatics/btaa115 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3242-3243 ◽

Cited By ~ 2

Author(s):

Samuel O’Donnell ◽

Gilles Fischer

Keyword(s):

De Novo ◽

Supplementary Information ◽

Genome Alignment ◽

Whole Genome ◽

Structural Variations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Human Genomes ◽

Whole Genome Alignment ◽

Primary Output

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter

10.1101/068338 ◽

2016 ◽

Cited By ~ 4

Author(s):

Shaun D Jackman ◽

Benjamin P Vandervalk ◽

Hamid Mohamadi ◽

Justin Chu ◽

Sarah Yeo ◽

...

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Message Passing ◽

Large Scale ◽

De Novo ◽

Bloom Filter ◽

Genomic Variation ◽

De Bruijn Graph ◽

Single Individual ◽

Probabilistic Data Structure

AbstractThe assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps towards elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depends on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely.With ABySS 1.0, we originally showed that assembling the human genome using short 50 bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its re-design, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.We present assembly benchmarks of human Genome in a Bottle 250 bp Illumina paired-end and 6 kbp mate-pair libraries from a single individual, yielding a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using less than 35 GB of RAM, a modest memory requirement by today’s standard that is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold contiguity of this assembly to 42 (15) Mbp.

Download Full-text

SINE jumping contributes to large-scale polymorphisms in the pig genomes

10.21203/rs.3.rs-352249/v1 ◽

2021 ◽

Author(s):

Cai Chen ◽

Enrico D'Alessandro ◽

Eduard Murani ◽

Yao Zheng ◽

Domenico Giosa ◽

...

Keyword(s):

Genetic Analysis ◽

Population Genetic ◽

Large Scale ◽

Structural Variations ◽

Population Genetic Analysis ◽

Protein Coding ◽

Pig Breeds ◽

A Genome ◽

Trait Locus ◽

And Cluster Analysis

Abstract Background: Molecular markers based on retrotransposon insertion polymorphisms (RIPs) have been developed and are widely used in plants and animals. Short interspersed nuclear elements (SINEs) exert wide impacts on gene activity and even on phenotypes. However, SINE RIP profiles in livestock remain largely unknown, and not be revealed in pigs. Results: Our data revealed that SINEA1 displayed the most polymorphic insertions (22.5% intragenic and 26.5% intergenic), followed by SINEA2 (10.5% intragenic and 9% intergenic) and SINEA3 (12.5% intragenic and 5.0% intergenic). We developed a genome-wide SINE RIP mining protocol and obtained a large number of SINE RIPs (36,284), with over 80% accuracy and an even distribution in chromosomes (14.5/Mb), and 74.34% of SINE RIPs generated by SINEA1 element. Over 65% of pig SINE RIPs overlap with genes, with significant enrichment in the first and second introns of protein-coding and long non-coding RNA genes. Nearly half of the RIPs are common in these pig breeds. Sixteen SINE RIPs were applied for population genetic analysis in 23 pig breeds, the phylogeny tree and cluster analysis were generally consistent with the geographical distributions of native pig breeds in China. Conclusions: Our analysis revealed that SINEA1–3 elements, particularly SINEA1, are high polymorphic across different pig breeds, and generate large-scale structural variations in the pig genomes. And over 35, 000 SINE RIP markers were obtained. These data indicate that young SINE elements play important roles in creating new genetic variations and shaping the evolution of pig genome, and also provide strong evidences to support the great potential of SINE RIPs as genetic markers, which can be used for population genetic analysis and quantitative trait locus (QTL) mapping in pig.

Download Full-text

A Multireference-Based Whole Genome Assembly for the Obligate Ant-Following Antbird, Rhegmatorhina melanosticta (Thamnophilidae)

Diversity ◽

10.3390/d11090144 ◽

2019 ◽

Vol 11 (9) ◽

pp. 144 ◽

Cited By ~ 4

Author(s):

Laís Coelho ◽

Lukas Musher ◽

Joel Cracraft

Keyword(s):

Genome Assembly ◽

High Throughput Sequencing ◽

Population Genomics ◽

De Novo ◽

Structural Difference ◽

Whole Genome ◽

Sequencing Technology ◽

A Genome ◽

Avian Genomes ◽

Chromosome Level

Current generation high-throughput sequencing technology has facilitated the generation of more genomic-scale data than ever before, thus greatly improving our understanding of avian biology across a range of disciplines. Recent developments in linked-read sequencing (Chromium 10×) and reference-based whole-genome assembly offer an exciting prospect of more accessible chromosome-level genome sequencing in the near future. We sequenced and assembled a genome of the Hairy-crested Antbird (Rhegmatorhina melanosticta), which represents the first publicly available genome for any antbird (Thamnophilidae). Our objectives were to (1) assemble scaffolds to chromosome level based on multiple reference genomes, and report on differences relative to other genomes, (2) assess genome completeness and compare content to other related genomes, and (3) assess the suitability of linked-read sequencing technology for future studies in comparative phylogenomics and population genomics studies. Our R. melanosticta assembly was both highly contiguous (de novo scaffold N50 = 3.3 Mb, reference based N50 = 53.3 Mb) and relatively complete (contained close to 90% of evolutionarily conserved single-copy avian genes and known tetrapod ultraconserved elements). The high contiguity and completeness of this assembly enabled the genome to be successfully mapped to the chromosome level, which uncovered a consistent structural difference between R. melanosticta and other avian genomes. Our results are consistent with the observation that avian genomes are structurally conserved. Additionally, our results demonstrate the utility of linked-read sequencing for non-model genomics. Finally, we demonstrate the value of our R. melanosticta genome for future researchers by mapping reduced representation sequencing data, and by accurately reconstructing the phylogenetic relationships among a sample of thamnophilid species.

Download Full-text

Towards the Human Cancer Genome Project: A Sequence-Ready Physical Map of a Follicular Lymphoma Genome.

Blood ◽

10.1182/blood.v106.11.605.605 ◽

2005 ◽

Vol 106 (11) ◽

pp. 605-605

Author(s):

Marco A. Marra ◽

Martin Krzywinski ◽

Readman Chiu ◽

Matthew Field ◽

Inanc Birol ◽

...

Keyword(s):

Follicular Lymphoma ◽

Human Genome ◽

Large Scale ◽

Reference Genome ◽

Reference Sequence ◽

Whole Genome ◽

Bac Clones ◽

Genome Maps ◽

Tumor Genome ◽

Reference Human Genome

Abstract With the aim of identifying and sequencing mutations in follicular lymphoma genomes, we have begun a project to generate at least 24 deeply redundant sequence-ready Bacterial Artificial Clone (BAC) - based whole genome maps, each from a different individual’s lymphoma. BAC-array CGH and Affymetrix whole-genome sampling assays (WGSA) will be used along with the mapping data to identify genomic amplifications and losses in the lymphomas. Results from the mapping and array studies will be used to prioritize BAC clones for sequence analysis. Because each map will span essentially the entire genome of the corresponding lymphoma, we anticipate that essentially all regions of each tumor genome will be represented in easily sequenced BAC clones. This approach facilitates targeted sequencing of genomic regions of interest, including those containing genes relevant to cancer or harboring amplifications or deletions. Our mapping strategy hinges on the successful creation of deeply redundant high quality BAC libraries from primary lymphomas and large scale high throughput restriction enzyme fingerprinting of individual BACs with a version of the technology we used to map the human, mouse, rat and other genomes. The effort is large-scale, and will result in the generation of at least 2.5 million fingerprinted BAC clones over the next three years. Using the fingerprints, we will align the BACs to the reference human genome to assess genome coverage and to identify candidate genome rearrangements. In parallel, we will assemble the fingerprints into genome maps, looking for larger-scale genome variations between the lymphoma maps and the reference genome sequence. To test the feasibility of our approach, we obtained two restriction digest fingerprints from each of 140,000 individual BAC clones. BACs were sampled from a 7-fold redundant BAC library that had been created from genomic DNA purified from a primary follicular lymphoma sample. The fingerprints are being assembled into a clone map with the intent of reconstructing the entire tumor genome. 90,377 fingerprinted clones with unambiguous single alignments to the reference sequence were automatically assembled into 15,538 contigs. Subsequent rounds of semi-automatic contig merging further reduced the number of contigs to 5,433. Only 1,241 clones remained unassembled. We anchored the tumor genome map to the reference human genome sequence by aligning the clone fingerprints to the restriction map computed from the reference sequence assembly. As a result of this, we identified a BAC that captured the canonical t(14;18) translocation characteristic of follicular lymphomas. We sequenced this BAC and confirmed that it contains the expected translocation. Almost 2.6 gigabases (~91%) of the reference genome are represented in the evolving map, with an additional 50,000 clone fingerprints awaiting incorporation into the map assembly. Among these are repeat-rich and other clones that may well harbor genome rearrangements. Additional prioritization of sequencing targets will be undertaken when map construction and analysis of genome copy number alterations are complete.

Download Full-text

Applying Small-Scale DNA Signatures as an Aid in Assembling Soybean Chromosome Sequences

Advances in Bioinformatics ◽

10.1155/2010/976792 ◽

2010 ◽

Vol 2010 ◽

pp. 1-7

Author(s):

Myron Peto ◽

David M. Grant ◽

Randy C. Shoemaker ◽

Steven B. Cannon

Keyword(s):

Quality Control ◽

Binding Energy ◽

Dna Binding ◽

Large Scale ◽

Repetitive Sequences ◽

Soybean Genome ◽

Small Scale ◽

Whole Genome ◽

Soybean Chromosome ◽

A Genome

Previous work has established a genomic signature based on relative counts of the 16 possible dinucleotides. Until now, it has been generally accepted that the dinucleotide signature is characteristic of a genome and is relatively homogeneous across a genome. However, we found some local regions of the soybean genome with a signature differing widely from that of the rest of the genome. Those regions were mostly centromeric and pericentromeric, and enriched for repetitive sequences. We found that DNA binding energy also presented large-scale patterns across soybean chromosomes. These two patterns were helpful during assembly and quality control of soybean whole genome shotgun scaffold sequences into chromosome pseudomolecules.

Download Full-text