scholarly journals Detection and assembly of novel sequence insertions using Linked-Read technology

2019 ◽  
Author(s):  
Dmitry Meleshko ◽  
Patrick Marks ◽  
Stephen Williams ◽  
Iman Hajirasouliha

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary

2020 ◽  
Vol 36 (10) ◽  
pp. 3242-3243 ◽  
Author(s):  
Samuel O’Donnell ◽  
Gilles Fischer

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.


Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 5201-5201
Author(s):  
Ute Fischer ◽  
Layal Yasin ◽  
Julia Täubner ◽  
Triantafyllia Brozou ◽  
Arndt Borkhardt

Germline mutations account for a substantial proportion of childhood cancer and may critically affect disease characteristics, therapy efficacy, severity of treatment side effects and patient outcome. To date, only 8-10% of childhood cancer cases can be explained by germline mutations identified in known cancer predisposing genes. This is in part due to the technical limitation of next generation short read sequencing, which detects single nucleotide variants, small deletions/insertions or simple copy number variations, but is not a reliable tool to identify larger structural variations (SVs, >500 bp) which are frequent in the human genome and may impact on disease predisposition. Using whole genome optical mapping (WGOM) we aimed at identification of de novo and inherited germline SVs in a cohort of patients with clinically suspected cancer predisposition but without informative findings in short read sequencing analyses. After informed consent we performed family trio based short read (2x 100 bp) whole exome sequencing (WES) on a HiSeq2500 (Illumina) and collected clinical and demographic data for a cohort of >100 families with children affected by cancer who were treated in our hospital. About 25% of the patients either (1) had a family history indicative of cancer susceptibility, or (2) had accompanying clinical findings (e.g. developmental delay, congenital anomalies) or (3) experienced excessive toxicity during chemotherapy. From this subgroup we selected four patients with acute lymphoblastic leukemia whose sequencing data and routine genetic workup were not informative of a known cancer predisposing syndrome and employed family trio-based next generation WGOM on a Saphyr instrument equipped with Access software (Bionano Genomics) to identify genomic SVs. To this end, we extracted and labeled high molecular weight DNA molecules at specific hexamer sequence motifs (average distance: 5 kb) using a DNA methyltransferase-based direct labeling reaction. Imaging was carried out on single-molecule level and each sample genome was de novo assembled from molecule data. Consensus genome maps were clustered into two alleles and diploid assemblies created. Genomes of patients were compared to parental genomes and the GRCh38 reference genome. SVs were inferred from de novo assemblies and genome comparisons with respect to quality scores, overall molecule coverage, fraction of molecules displaying the SV event, and chimeric DNA fragment mapping. Specific SV calls were compared to a set of > 160 human control samples (provided by Bionano Genomics) to filter against common SVs and potential artifacts. Filtered SVs were annotated using structural variant and gene databases. Employing WGOM we analyzed DNA molecules 300.000 bp long on average and achieved genomic coverage ranging from 90-132x corresponding to 330-480 Gbp. For instance, for one patient, we obtained 1751 insertions, 624 deletions, 77 inversions, 21 duplications, 1 intra- and 2 inter-chromosomal translocations before filtering. The majority of these events (78%) were inherited from both parents. 20% were inherited from either father or mother and 2% were generated de novo. As the family history of this patient was inconspicuous for tumor diseases, we removed all inherited events and filtered against common variants. This resulted in only two candidate de novo lesions: a heterozygous 129,495 bp deletion framed by inversions (chr9: 66,156,733-66,622,623) in a gene-less region and a heterozygous inverted 352,667 bp duplication (chr22: 15,522,454-15.875,120) that spanned the genes OR11H, POTEH, POTEH-AS1, LINC01297, DUXAP8, and BMS1P22. Of these genes DUXAP8 is an oncogenic non-coding RNA of the homeobox gene family that has been associated with increased tumor growth and poorer prognosis in a wide variety of somatic cancers. It functions as a regulator of transcription by binding to key components of the developmental regulator epigenetic polycomb repressive complex 2 and may thus account for additional presentations of the child (dwarfism, accelerated skeletal age, linguistic developmental delay, morphological traits). Our results indicate that WGOM is a useful technology to identify candidate SVs in children predisposed to cancer and developmental syndromes. Several candidates are currently being tested and the results will be presented. Disclosures No relevant conflicts of interest to declare.


2017 ◽  
Author(s):  
Patrick Marks ◽  
Sarah Garcia ◽  
Alvaro Martinez Barrio ◽  
Kamila Belhocine ◽  
Jorge Bernate ◽  
...  

AbstractLarge-scale population based analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short read whole genome sequencing. However, standard short-read approaches, used primarily due to accuracy, throughput and costs, fail to give a complete picture of a genome. They struggle to identify large, balanced structural events, cannot access repetitive regions of the genome and fail to resolve the human genome into its two haplotypes. Here we describe an approach that retains long range information while harnessing the advantages of short reads. Starting from only ∼1ng of DNA, we produce barcoded short read libraries. The use of novel informatic approaches allows for the barcoded short reads to be associated with the long molecules of origin producing a novel datatype known as ‘Linked-Reads’. This approach allows for simultaneous detection of small and large variants from a single Linked-Read library. We have previously demonstrated the utility of whole genome Linked-Reads (lrWGS) for performing diploid, de novo assembly of individual genomes (Weisenfeld et al. 2017). In this manuscript, we show the advantages of Linked-Reads over standard short read approaches for reference based analysis. We demonstrate the ability of Linked-Reads to reconstruct megabase scale haplotypes and to recover parts of the genome that are typically inaccessible to short reads, including phenotypically important genes such as STRC, SMN1 and SMN2. We demonstrate the ability of both lrWGS and Linked-Read Whole Exome Sequencing (lrWES) to identify complex structural variations, including balanced events, single exon deletions, and single exon duplications. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Sai Chen ◽  
Peter Krusche ◽  
Egor Dolzhenko ◽  
Rachel M. Sherman ◽  
Roman Petrovski ◽  
...  

AbstractAccurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long-read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Akihiro Fujimoto ◽  
Jing Hao Wong ◽  
Yukiko Yoshii ◽  
Shintaro Akiyama ◽  
Azusa Tanaka ◽  
...  

AbstractBackgroundIdentification of germline variation and somatic mutations is a major issue in human genetics. However, due to the limitations of DNA sequencing technologies and computational algorithms, our understanding of genetic variation and somatic mutations is far from complete.MethodsIn the present study, we performed whole-genome sequencing using long-read sequencing technology (Oxford Nanopore) for 11 Japanese liver cancers and matched normal samples which were previously sequenced for the International Cancer Genome Consortium (ICGC). We constructed an analysis pipeline for the long-read data and identified germline and somatic structural variations (SVs).ResultsIn polymorphic germline SVs, our analysis identified 8004 insertions, 6389 deletions, 27 inversions, and 32 intra-chromosomal translocations. By comparing to the chimpanzee genome, we correctly inferred events that caused insertions and deletions and found that most insertions were caused by transposons andAluis the most predominant source, while other types of insertions, such as tandem duplications and processed pseudogenes, are rare. We inferred mechanisms of deletion generations and found that most non-allelic homolog recombination (NAHR) events were caused by recombination errors in SINEs. Analysis of somatic mutations in liver cancers showed that long reads could detect larger numbers of SVs than a previous short-read study and that mechanisms of cancer SV generation were different from that of germline deletions.ConclusionsOur analysis provides a comprehensive catalog of polymorphic and somatic SVs, as well as their possible causes. Our software are available athttps://github.com/afujimoto/CAMPHORandhttps://github.com/afujimoto/CAMPHORsomatic.


2021 ◽  
Vol 17 (6) ◽  
pp. e1009078
Author(s):  
Jingwen Ren ◽  
Mark J. P. Chaisson

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).


2017 ◽  
Author(s):  
Andrew J. Page ◽  
Alexander Wailan ◽  
Yan Shao ◽  
Kim Judge ◽  
Gordon Dougan ◽  
...  

AbstractWhen defining bacterial populations through whole genome sequencing (WGS) the samples often have detailed associated metadata that relate to disease severity, antimicrobial resistance, or even rare biochemical traits. When comparing these bacterial populations, it is apparent that some of these phenotypes do not follow the phylogeny of the host i.e. they are genetically unlinked to the evolutionary history of the host bacterium. One possible explanation for this phenomenon is that the genes are moving independently between hosts and are likely associated with mobile genetic elements (MGE). However, identifying the element that is associated with these traits can be complex if the starting point is short read WGS data. With the increased use of next generation WGS in routine diagnostics, surveillance and epidemiology a vast amount of short read data is available and these types of associations are relatively unexplored. One way to address this would be to perform assembly de novo of the whole genome read data, including its MGEs. However, MGEs are often full of repeats and can lead to fragmented consensus sequences. Deciding which sequence is part of the chromosome, and which is part of a MGE can be ambiguous. We present PlasmidTron, which utilises the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographic information, to identify sequences that are likely to represent MGEs linked to the phenotype. Given a set of reads, categorised into cases (showing the phenotype) and controls (phylogenetically related but phenotypically negative), PlasmidTron can be used to assemble de novo reads from each sample linked by a phenotype. A k-mer based analysis is performed to identify reads associated with a phylogenetically unlinked phenotype. These reads are then assembled de novo to produce contigs. By utilising k-mers and only assembling a fraction of the raw reads, the method is fast and scalable to large datasets. This approach has been tested on plasmids, because of their contribution to important pathogen associated traits, such as AMR, hence the name, but there is no reason why this approach cannot be utilized for any MGE that can move independently through a bacterial population. PlasmidTron is written in Python 3 and available under the open source licence GNU GPL3 from https://github.com/sanger-pathogens/plasmidtron.DATA SUMMARYSource code for PlasmidTron is available from Github under the open source licence GNU GPL 3; (url - https://goo.gl/ot6rT5)Simulated raw reads files have been deposited in Figshare; (url - https://doi.org/10.6084/m9.figshare.5406355.vl)Salmonella enterica serovar Weltevreden strain VNS10259 is available from GenBank; accession number GCA_001409135.Salmonella enterica serovar Typhi strain BL60006 is available from GenBank; accession number GCA_900185485.Accession numbers for all of the Illumina datasets used in this paper are listed in the supplementary tables.I/We confirm all supporting data, code and protocols have been provided within the article or through supplementary data files. ⊠IMPACT STATEMENTPlasmidTron utilises the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographic information, to identify sequences that are likely to represent MGEs linked to the phenotype.


2020 ◽  
Vol 36 (17) ◽  
pp. 4568-4575
Author(s):  
Lolita Lecompte ◽  
Pierre Peterlongo ◽  
Dominique Lavenier ◽  
Claire Lemaitre

Abstract Motivation Studies on structural variants (SVs) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, the number of discovered SVs is increasing, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it is important to genotype newly sequenced individuals on well-defined and characterized SVs. Whereas several SV genotypers have been developed for short read data, there is a lack of such dedicated tool to assess whether known SVs are present or not in a new long read sequenced sample, such as the one produced by Pacific Biosciences or Oxford Nanopore Technologies. Results We present a novel method to genotype known SVs from long read sequencing data. The method is based on the generation of a set of representative allele sequences that represent the two alleles of each structural variant. Long reads are aligned to these allele sequences. Alignments are then analyzed and filtered out to keep only informative ones, to quantify and estimate the presence of each SV allele and the allele frequencies. We provide an implementation of the method, SVJedi, to genotype SVs with long reads. The tool has been applied to both simulated and real human datasets and achieves high genotyping accuracy. We show that SVJedi obtains better performances than other existing long read genotyping tools and we also demonstrate that SV genotyping is considerably improved with SVJedi compared to other approaches, namely SV discovery and short read SV genotyping approaches. Availability and implementation https://github.com/llecompte/SVJedi.git Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Alba Sanchis-Juan ◽  
Jonathan Stephens ◽  
Courtney E French ◽  
Nicholas Gleadall ◽  
Karyn Mégy ◽  
...  

AbstractComplex structural variants (cxSVs) are genomic rearrangements comprising multiple structural variants, typically involving three or more breakpoint junctions. They contribute to human genomic variation and can cause Mendelian disease, however they are not typically considered during genetic testing. Here, we investigate the role of cxSVs in Mendelian disease using short-read whole genome sequencing (WGS) data from 1,324 individuals with neurodevelopmental or retinal disorders from the NIHR BioResource project. We present four cases of individuals with a cxSV affecting Mendelian disease-associated genes. Three of the cxSVs are pathogenic: a de novo duplication-inversion-inversion-deletion affecting ARID1B in an individual with Coffin-Siris syndrome, a deletion-inversion-duplication affecting HNRNPU in an individual with intellectual disability and seizures, and a homozygous deletion-inversion-deletion affecting CEP78 in an individual with cone-rod dystrophy. Additionally, we identified a de novo duplication-inversion-duplication overlapping CDKL5 in an individual with neonatal hypoxic-ischaemic encephalopathy. Long-read sequencing technology used to resolve the breakpoints demonstrated the presence of both a disrupted and an intact copy of CDKL5 on the same allele; therefore, it was classified as a variant of uncertain significance. Analysis of sequence flanking all breakpoint junctions in all the cxSVs revealed both microhomology and longer repetitive sequences, suggesting both replication and homology based processes. Accurate resolution of cxSVs is essential for clinical interpretation, and here we demonstrate that long-read WGS is a powerful technology by which to achieve this. Our results show cxSVs are an important although rare cause of Mendelian disease, and we therefore recommend their consideration during research and clinical investigations.


F1000Research ◽  
2018 ◽  
Vol 6 ◽  
pp. 618 ◽  
Author(s):  
Michael Liem ◽  
Hans J. Jansen ◽  
Ron P. Dirks ◽  
Christiaan V. Henkel ◽  
G. Paul H. van Heusden ◽  
...  

Background: The introduction of the MinION sequencing device by Oxford Nanopore Technologies may greatly accelerate whole genome sequencing. Nanopore sequence data offers great potential for de novo assembly of complex genomes without using other technologies. Furthermore, Nanopore data combined with other sequencing technologies is highly useful for accurate annotation of all genes in the genome. In this manuscript we used nanopore sequencing as a tool to classify yeast strains. Methods: We compared various technical and software developments for the nanopore sequencing protocol, showing that the R9 chemistry is, as predicted, higher in quality than R7.3 chemistry. The R9 chemistry is an essential improvement for assembly of the extremely AT-rich mitochondrial genome. We double corrected assemblies from four different assemblers with PILON and assessed sequence correctness before and after PILON correction with a set of 290 Fungi genes using BUSCO. Results: In this study, we used this new technology to sequence and de novo assemble the genome of a recently isolated ethanologenic yeast strain, and compared the results with those obtained by classical Illumina short read sequencing. This strain was originally named Candida vartiovaarae (Torulopsis vartiovaarae) based on ribosomal RNA sequencing. We show that the assembly using nanopore data is much more contiguous than the assembly using short read data. We also compared various technical and software developments for the nanopore sequencing protocol, showing that nanopore-derived assemblies provide the highest contiguity. Conclusions: The mitochondrial and chromosomal genome sequences showed that our strain is clearly distinct from other yeast taxons and most closely related to published Cyberlindnera species. In conclusion, MinION-mediated long read sequencing can be used for high quality de novo assembly of new eukaryotic microbial genomes.


Sign in / Sign up

Export Citation Format

Share Document