Reconstructing the Gigabase Plant Genome of Solanum pennellii using Nanopore Sequencing

Mapping Intimacies ◽

10.1101/129148 ◽

2017 ◽

Cited By ~ 2

Author(s):

Maximilian H.-W. Schmidt ◽

Alxander Vogel ◽

Alisandra K. Denton ◽

Benjamin Istace ◽

Alexandra Wormit ◽

...

Keyword(s):

Error Rate ◽

De Novo ◽

Sequence Data ◽

Fragment Size ◽

Plant Genome ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Sequencing Technology ◽

Solanum Pennellii ◽

Wild Tomato Species

Recent updates in sequencing technology have made it possible to obtain Gigabases of sequence data from one single flowcell. Prior to this update, the nanopore sequencing technology was mainly used to analyze and assemble microbial samples1-3. Here, we describe the generation of a comprehensive nanopore sequencing dataset with a median fragment size of 11,979 bp for the wild tomato species Solanum pennellii featuring an estimated genome size of ca 1.0 to 1.1 Gbases. We describe its genome assembly to a contig N50 of 2.5 MB using a pipeline comprising a Canu4 pre-processing and a subsequent assembly using SMARTdenovo. We show that the obtained nanopore based de novo genome reconstruction is structurally highly similar to that of the reference S. pennellii LA7165 genome but has a high error rate caused mostly by deletions in homopolymers. After polishing the assembly with Illumina short read data we obtained an error rate of <0.02 % when assessed versus the same Illumina data. More importantly however we obtained a gene completeness of 96.53% which even slightly surpasses that of the reference S. pennellii genome5. Taken together our data indicate such long read sequencing data can be used to affordably sequence and assemble Gbase sized diploid plant genomes.Raw data is available at http://www.plabipd.de/portal/solanum-pennellii and has been deposited as PRJEB19787.

Download Full-text

Comparative analysis of alignment tools for application on Nanopore sequencing data

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2021-2212 ◽

2021 ◽

Vol 7 (2) ◽

pp. 831-834

Author(s):

Chiara Becht ◽

Jonas Schmidt ◽

Frithjof Blessing ◽

Folker Wenzel

Keyword(s):

Error Rate ◽

De Novo ◽

Performance Criteria ◽

Computational Time ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Accurate Analysis ◽

Match Rate ◽

Long Read ◽

And Performance

Abstract INTRODUCTION: Long-read sequencing techniques such as Oxford Nanopore sequencing, are representing a promising novel approach in molecular-biological methodology, enabling potential facilitation in mapping and de novo assembly. In comparison to conventional sequencing methods, novel alignment tools are mandated to compensate differing data structures (especially high error rate) to achieve acceptably accurate analysis results. METHODS: In this study, benchmarking for long read aligners BLASR, GraphMap, LAST, minimap2, NGMLR and the short-read aligner BWA MEM on three experimental datasets was conducted. Obtained alignment results were compared for various quality and performance criteria, such as match rate, mismatch rate, error rate, working memory usage and computational time. RESULTS: The comparison yielded differences in alignment quality and performance of tools under test. Tool LAST showed the largest differences among all tools. Minimap2 achieved constant quality with good performance. BLASR, GraphMap, BWA MEM and NGMLR showed slight differences only. CONCLUSION: Differences among the tools could be reasoned with dataset characteristics and algorithm approaches of individual tools. All tools except BLASR seem applicable for Nanopore sequencing data. Therefore, selection of the tool should be done under consideration of the experimental design and the further downstream analysis

Download Full-text

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Briefings in Bioinformatics ◽

10.1093/bib/bby017 ◽

2018 ◽

Vol 20 (4) ◽

pp. 1542-1559 ◽

Cited By ~ 44

Author(s):

Damla Senol Cali ◽

Jeremie S Kim ◽

Saugata Ghose ◽

Can Alkan ◽

Onur Mutlu

Keyword(s):

Sequence Analysis ◽

Genome Assembly ◽

Sequence Data ◽

Error Rates ◽

Nanopore Sequencing ◽

Memory Usage ◽

Sequencing Technology ◽

Assembly Pipeline ◽

And Performance ◽

Polishing Tool

Abstract Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.

Download Full-text

NanoGalaxy: Nanopore long-read sequencing data analysis in Galaxy

GigaScience ◽

10.1093/gigascience/giaa105 ◽

2020 ◽

Vol 9 (10) ◽

Cited By ~ 1

Author(s):

Willem de Koning ◽

Milad Miladi ◽

Saskia Hiltemann ◽

Astrid Heikema ◽

John P Hays ◽

...

Keyword(s):

Genome Assembly ◽

Bioinformatics Analysis ◽

De Novo ◽

Sequence Data ◽

Ease Of Use ◽

Easy Access ◽

Complex Data ◽

Sequencing Data ◽

Long Read ◽

Sequencing Platforms

Abstract Background Long-read sequencing can be applied to generate very long contigs and even completely assembled genomes at relatively low cost and with minimal sample preparation. As a result, long-read sequencing platforms are becoming more popular. In this respect, the Oxford Nanopore Technologies–based long-read sequencing “nanopore" platform is becoming a widely used tool with a broad range of applications and end-users. However, the need to explore and manipulate the complex data generated by long-read sequencing platforms necessitates accompanying specialized bioinformatics platforms and tools to process the long-read data correctly. Importantly, such tools should additionally help democratize bioinformatics analysis by enabling easy access and ease-of-use solutions for researchers. Results The Galaxy platform provides a user-friendly interface to computational command line–based tools, handles the software dependencies, and provides refined workflows. The users do not have to possess programming experience or extended computer skills. The interface enables researchers to perform powerful bioinformatics analysis, including the assembly and analysis of short- or long-read sequence data. The newly developed “NanoGalaxy" is a Galaxy-based toolkit for analysing long-read sequencing data, which is suitable for diverse applications, including de novo genome assembly from genomic, metagenomic, and plasmid sequence reads. Conclusions A range of best-practice tools and workflows for long-read sequence genome assembly has been integrated into a NanoGalaxy platform to facilitate easy access and use of bioinformatics tools for researchers. NanoGalaxy is freely available at the European Galaxy server https://nanopore.usegalaxy.eu with supporting self-learning training material available at https://training.galaxyproject.org.

Download Full-text

Draft Genome Sequence and intraspecific diversification of the wild crop relative Brassica cretica Lam. using demographic model selection

10.1101/521138 ◽

2019 ◽

Author(s):

Antonis Kioukis ◽

Vassiliki A. Michalopoulou ◽

Laura Briers ◽

Stergios Pirintsos ◽

David J. Studholme ◽

...

Keyword(s):

Genome Sequence ◽

De Novo ◽

Sequence Data ◽

Plant Genetic Resources ◽

Crop Improvement ◽

Draft Genome ◽

Illumina Miseq ◽

Crop Wild Relatives ◽

Demographic Model ◽

Sequencing Data

AbstractCrop wild relatives contain great levels of genetic diversity, representing an invaluable resource for crop improvement. Many of their traits have the potential to help crops become more resistant and resilient, and adapt to the new conditions that they will experience due to climate change. An impressive global effort occurs for the conservation of various wild crop relatives and facilitates their use in crop breeding for food security.The genus Brassica is listed in Annex I of the International Treaty on Plant Genetic Resources for Food and Agriculture. Brassica oleracea (or wild cabbage) is a species native to coastal southern and western Europe that has become established as an important human food crop plant because of its large reserves stored over the winter in its leaves.Brassica cretica Lam. is a wild relative crop in the brassica group and B. cretica subsp. nivea has been suggested as a separate subspecies. The species B. cretica has been proposed as a potential gene donor to a number of crops in the brassica group, including broccoli, Brussels sprout, cabbage, cauliflower, kale, swede, turnip and oilseed rape.Here, we present the draft de novo genome assemblies of four B. cretica individuals, including two B. cretica subsp. nivea and two B. cretica.De novo assembly of Illumina MiSeq genomic shotgun sequencing data yielded 243,461 contigs totalling 412.5 Mb in length, corresponding to 122 % of the estimated genome size of B. cretica (339 Mb). According to synteny mapping and phylogenetic analysis of conserved genes, B. cretica genome based on our sequence data reveals approximately 30.360 proteins.Furthermore, our demographic analysis based on whole genome data, suggests that distinct populations of B. cretica are not isolated. Our findings suggest that the classification of the B. cretica in distinct subspecies is not supported from the genome sequence data we analyzed.

Download Full-text

LeafGo: Leaf to Genome, a quick workflow to produce high-quality de novo plant genomes using long-read sequencing technology

Genome Biology ◽

10.1186/s13059-021-02475-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Patrick Driguez ◽

Salim Bougouffa ◽

Karen Carty ◽

Alexander Putra ◽

Kamel Jabbari ◽

...

Keyword(s):

De Novo ◽

Plant Genome ◽

Time Cost ◽

Sequencing Technology ◽

High Quality ◽

Plant Genomes ◽

Long Read ◽

Generate Plant ◽

Sequencing Platforms ◽

Chromosome Level

AbstractCurrently, different sequencing platforms are used to generate plant genomes and no workflow has been properly developed to optimize time, cost, and assembly quality. We present LeafGo, a complete de novo plant genome workflow, that starts from tissue and produces genomes with modest laboratory and bioinformatic resources in approximately 7 days and using one long-read sequencing technology. LeafGo is optimized with ten different plant species, three of which are used to generate high-quality chromosome-level assemblies without any scaffolding technologies. Finally, we report the diploid genomes of Eucalyptus rudis and E. camaldulensis and the allotetraploid genome of Arachis hypogaea.

Download Full-text

First de novo draft genome sequence of Oryza coarctata, the only halophytic species in the genus Oryza

F1000Research ◽

10.12688/f1000research.12414.1 ◽

2017 ◽

Vol 6 ◽

pp. 1750 ◽

Cited By ~ 8

Author(s):

Tapan Kumar Mondal ◽

Hukam Chand Rawal ◽

Kishor Gaikwad ◽

Tilak Raj Sharma ◽

Nagendra Kumar Singh

Keyword(s):

West Bengal ◽

De Novo ◽

Draft Genome ◽

Nanopore Sequencing ◽

Sequencing Technology ◽

Genome Sequences ◽

Oxford Nanopore ◽

Hybrid Genome ◽

Genus Oryza ◽

First Time

Oryza coarctata plants, collected from Sundarban delta of West Bengal, India, have been used in the present study to generate draft genome sequences, employing the hybrid genome assembly with Illumina reads and third generation Oxford Nanopore sequencing technology. We report for the first time that more than 85.71 % of the genome coverage and the data have been deposited in NCBI SRA, with BioProject ID PRJNA396417.

Download Full-text

A de novo assembly of the sweet cherry (Prunus avium cv. Tieton) genome using linked-read sequencing technology

PeerJ ◽

10.7717/peerj.9114 ◽

2020 ◽

Vol 8 ◽

pp. e9114 ◽

Cited By ~ 1

Author(s):

Jiawei Wang ◽

Weizhen Liu ◽

Dongzi Zhu ◽

Xiang Zhou ◽

Po Hong ◽

...

Keyword(s):

Sweet Cherry ◽

Prunus Avium ◽

Reference Genome ◽

De Novo ◽

Draft Genome ◽

Single Copy ◽

Sequencing Data ◽

Sequencing Technology ◽

High Quality ◽

Eukaryotic Genes

The sweet cherry (Prunus avium) is one of the most economically important fruit species in the world. However, there is a limited amount of genetic information available for this species, which hinders breeding efforts at a molecular level. We were able to describe a high-quality reference genome assembly and annotation of the diploid sweet cherry (2n = 2x = 16) cv. Tieton using linked-read sequencing technology. We generated over 750 million clean reads, representing 112.63 GB of raw sequencing data. The Supernova assembler produced a more highly-ordered and continuous genome sequence than the current P. avium draft genome, with a contig N50 of 63.65 KB and a scaffold N50 of 2.48 MB. The final scaffold assembly was 280.33 MB in length, representing 82.12% of the estimated Tieton genome. Eight chromosome-scale pseudomolecules were constructed, completing a 214 MB sequence of the final scaffold assembly. De novo, homology-based, and RNA-seq methods were used together to predict 30,975 protein-coding loci. 98.39% of core eukaryotic genes and 97.43% of single copy orthologues were identified in the embryo plant, indicating the completeness of the assembly. Linked-read sequencing technology was effective in constructing a high-quality reference genome of the sweet cherry, which will benefit the molecular breeding and cultivar identification in this species.

Download Full-text

A benchmarking of human mitochondrial DNA haplogroup classifiers from whole-genome and whole-exome sequence data

10.1101/2021.02.11.430775 ◽

2021 ◽

Author(s):

Víctor García-Olivares ◽

Adrián Muñoz-Barrera ◽

José Miguel Lorenzo-Salazar ◽

Carlos Zaragoza-Trello ◽

Luis A. Rubio-Rodríguez ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Sequence Data ◽

Qualitative Assessment ◽

Whole Genome ◽

Third Generation ◽

Sequencing Data ◽

Short Read ◽

Bioinformatic Tools ◽

Whole Exome

AbstractThe mitochondrial genome (mtDNA) is of interest for a range of fields including evolutionary, forensic, and medical genetics. Human mitogenomes can be classified into evolutionary related haplogroups that provide ancestral information and pedigree relationships. Because of this and the advent of high-throughput sequencing (HTS) technology, there is a diversity of bioinformatic tools for haplogroup classification. We present a benchmarking of the 11 most salient tools for human mtDNA classification using empirical whole-genome (WGS) and whole-exome (WES) short-read sequencing data from 36 unrelated donors. Besides, because of its relevance, we also assess the best performing tool in third-generation long noisy read WGS data obtained with nanopore technology for a subset of the donors. We found that, for short-read WGS, most of the tools exhibit high accuracy for haplogroup classification irrespective of the input file used for the analysis. However, for short-read WES, Haplocheck and MixEmt were the most accurate tools. Based on the performance shown for WGS and WES, and the accompanying qualitative assessment, Haplocheck stands out as the most complete tool. For third-generation HTS data, we also showed that Haplocheck was able to accurately retrieve mtDNA haplogroups for all samples assessed, although only after following assembly-based approaches (either based on a referenced-based assembly or a hybrid de novo assembly). Taken together, our results provide guidance for researchers to select the most suitable tool to conduct the mtDNA analyses from HTS data.

Download Full-text

Evaluating nanopore sequencing data processing pipelines for structural variation identification

Genome Biology ◽

10.1186/s13059-019-1858-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 9

Author(s):

Anbo Zhou ◽

Timothy Lin ◽

Jinchuan Xing

Keyword(s):

Detection Accuracy ◽

Nanopore Sequencing ◽

Sequencing Data ◽

Sequencing Technology ◽

Structural Variations ◽

Human Genomes ◽

Data Assessment ◽

Machine Learning Approach ◽

Long Read ◽

The Impact

Abstract Background Structural variations (SVs) account for about 1% of the differences among human genomes and play a significant role in phenotypic variation and disease susceptibility. The emerging nanopore sequencing technology can generate long sequence reads and can potentially provide accurate SV identification. However, the tools for aligning long-read data and detecting SVs have not been thoroughly evaluated. Results Using four nanopore datasets, including both empirical and simulated reads, we evaluate four alignment tools and three SV detection tools. We also evaluate the impact of sequencing depth on SV detection. Finally, we develop a machine learning approach to integrate call sets from multiple pipelines. Overall SV callers’ performance varies depending on the SV types. For an initial data assessment, we recommend using aligner minimap2 in combination with SV caller Sniffles because of their speed and relatively balanced performance. For detailed analysis, we recommend incorporating information from multiple call sets to improve the SV call performance. Conclusions We present a workflow for evaluating aligners and SV callers for nanopore sequencing data and approaches for integrating multiple call sets. Our results indicate that additional optimizations are needed to improve SV detection accuracy and sensitivity, and an integrated call set can provide enhanced performance. The nanopore technology is improving, and the sequencing community is likely to grow accordingly. In turn, better benchmark call sets will be available to more accurately assess the performance of available tools and facilitate further tool development.

Download Full-text

Exome-Wide Analysis of the DiscovEHR Cohort Reveals Novel Candidate Pharmacogenomic Variants for Clinical Pharmacogenomics

Genes ◽

10.3390/genes11050561 ◽

2020 ◽

Vol 11 (5) ◽

pp. 561

Author(s):

Maria-Theodora Pandi ◽

Marc S. Williams ◽

Peter van der Spek ◽

Maria Koromina ◽

George P. Patrinos

Keyword(s):

Genetic Variation ◽

Sequence Data ◽

Sequencing Data ◽

Sequencing Technology ◽

Next Generation Sequencing Technology ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data ◽

Generation Sequencing

Recent advances in next-generation sequencing technology have led to the production of an unprecedented volume of genomic data, thus further advancing our understanding of the role of genetic variation in clinical pharmacogenomics. In the present study, we used whole exome sequencing data from 50,726 participants, as derived from the DiscovEHR cohort, to identify pharmacogenomic variants of potential clinical relevance, according to their occurrence within the PharmGKB database. We further assessed the distribution of the identified rare and common pharmacogenomics variants amongst different GnomAD subpopulations. Overall, our findings show that the use of publicly available sequence data, such as the DiscovEHR dataset and GnomAD, provides an opportunity for a deeper understanding of genetic variation in pharmacogenes with direct implications in clinical pharmacogenomics.

Download Full-text