PaKman: Scalable Assembly of Large Genomes on Distributed Memory Machines

Mapping Intimacies ◽

10.1101/523068 ◽

2019 ◽

Author(s):

Priyanka Ghosh ◽

Sriram Krishnamoorthy ◽

Ananth Kalyanaraman

Keyword(s):

Genome Assembly ◽

Large Scale ◽

Distributed Memory ◽

High Throughput Sequencing ◽

De Novo ◽

State Of The Art ◽

Fundamental Problem ◽

Parallel Computer ◽

Assembly Process ◽

Data Movement

AbstractDe novo genome assembly is a fundamental problem in the field of bioinformatics, that aims to assemble the DNA sequence of an unknown genome from numerous short DNA fragments (aka reads) obtained from it. With the advent of high-throughput sequencing technologies, billions of reads can be generated in a matter of hours, necessitating efficient parallelization of the assembly process. While multiple parallel solutions have been proposed in the past, conducting a large-scale assembly at scale remains a challenging problem because of the inherent complexities associated with data movement, and irregular access footprints of memory and I/O operations. In this paper, we present a novel algorithm, called PaKman, to address the problem of performing large-scale genome assemblies on a distributed memory parallel computer. Our approach focuses on improving performance through a combination of novel data structures and algorithmic strategies for reducing the communication and I/O footprint during the assembly process. PaKman presents a solution for the two most time-consuming phases in the full genome assembly pipeline, namely, k-mer counting and contig generation.A key aspect of our algorithm is its graph data structure, which comprises fat nodes (or what we call “macro-nodes”) that reduce the communication burden during contig generation. We present an extensive performance and qualitative evaluation of our algorithm, including comparisons to other state-of-the-art parallel assemblers. Our results demonstrate the ability to achieve near-linear speedups on up to 8K cores (tested); outperform state-of-the-art distributed memory and shared memory tools in performance while delivering comparable (if not better) quality; and reduce time to solution significantly. For instance, PaKman is able to generate a high-quality set of assembled contigs for complex genomes such as the human and wheat genomes in a matter of minutes on 8K cores.

Download Full-text

Accurate long-read de novo assembly evaluation with Inspector

Genome Biology ◽

10.1186/s13059-021-02527-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yu Chen ◽

Yixin Zhang ◽

Amy Y. Wang ◽

Min Gao ◽

Zechen Chong

Keyword(s):

Genome Assembly ◽

De Novo Assembly ◽

In Silico ◽

Large Scale ◽

De Novo ◽

Small Scale ◽

De Novo Genome Assembly ◽

Consensus Sequences ◽

Assembly Evaluation ◽

Long Read

AbstractLong-read de novo genome assembly continues to advance rapidly. However, there is a lack of effective tools to accurately evaluate the assembly results, especially for structural errors. We present Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions. Based on in silico and long-read assembly results from multiple long-read data and assemblers, we demonstrate that in addition to providing generic metrics, Inspector can accurately identify both large-scale and small-scale assembly errors.

Download Full-text

Gene Annotation and Transcriptome Delineation on a De Novo Genome Assembly for the Reference Leishmania major Friedlin Strain

Genes ◽

10.3390/genes12091359 ◽

2021 ◽

Vol 12 (9) ◽

pp. 1359

Author(s):

Esther Camacho ◽

Sandra González-de la Fuente ◽

Jose C. Solana ◽

Alberto Rastrojo ◽

Fernando Carrasco-Ramiro ◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Molecular Mechanisms ◽

High Throughput Sequencing ◽

Leishmania Major ◽

De Novo ◽

Gene Annotation ◽

Leishmania Species ◽

De Novo Genome Assembly ◽

Sequencing Platforms

Leishmania major is the main causative agent of cutaneous leishmaniasis in humans. The Friedlin strain of this species (LmjF) was chosen when a multi-laboratory consortium undertook the objective of deciphering the first genome sequence for a parasite of the genus Leishmania. The objective was successfully attained in 2005, and this represented a milestone for Leishmania molecular biology studies around the world. Although the LmjF genome sequence was done following a shotgun strategy and using classical Sanger sequencing, the results were excellent, and this genome assembly served as the reference for subsequent genome assemblies in other Leishmania species. Here, we present a new assembly for the genome of this strain (named LMJFC for clarity), generated by the combination of two high throughput sequencing platforms, Illumina short-read sequencing and PacBio Single Molecular Real-Time (SMRT) sequencing, which provides long-read sequences. Apart from resolving uncertain nucleotide positions, several genomic regions were reorganized and a more precise composition of tandemly repeated gene loci was attained. Additionally, the genome annotation was improved by adding 542 genes and more accurate coding-sequences defined for around two hundred genes, based on the transcriptome delimitation also carried out in this work. As a result, we are providing gene models (including untranslated regions and introns) for 11,238 genes. Genomic information ultimately determines the biology of every organism; therefore, our understanding of molecular mechanisms will depend on the availability of precise genome sequences and accurate gene annotations. In this regard, this work is providing an improved genome sequence and updated transcriptome annotations for the reference L. major Friedlin strain.

Download Full-text

A Multireference-Based Whole Genome Assembly for the Obligate Ant-Following Antbird, Rhegmatorhina melanosticta (Thamnophilidae)

Diversity ◽

10.3390/d11090144 ◽

2019 ◽

Vol 11 (9) ◽

pp. 144 ◽

Cited By ~ 4

Author(s):

Laís Coelho ◽

Lukas Musher ◽

Joel Cracraft

Keyword(s):

Genome Assembly ◽

High Throughput Sequencing ◽

Population Genomics ◽

De Novo ◽

Structural Difference ◽

Whole Genome ◽

Sequencing Technology ◽

A Genome ◽

Avian Genomes ◽

Chromosome Level

Current generation high-throughput sequencing technology has facilitated the generation of more genomic-scale data than ever before, thus greatly improving our understanding of avian biology across a range of disciplines. Recent developments in linked-read sequencing (Chromium 10×) and reference-based whole-genome assembly offer an exciting prospect of more accessible chromosome-level genome sequencing in the near future. We sequenced and assembled a genome of the Hairy-crested Antbird (Rhegmatorhina melanosticta), which represents the first publicly available genome for any antbird (Thamnophilidae). Our objectives were to (1) assemble scaffolds to chromosome level based on multiple reference genomes, and report on differences relative to other genomes, (2) assess genome completeness and compare content to other related genomes, and (3) assess the suitability of linked-read sequencing technology for future studies in comparative phylogenomics and population genomics studies. Our R. melanosticta assembly was both highly contiguous (de novo scaffold N50 = 3.3 Mb, reference based N50 = 53.3 Mb) and relatively complete (contained close to 90% of evolutionarily conserved single-copy avian genes and known tetrapod ultraconserved elements). The high contiguity and completeness of this assembly enabled the genome to be successfully mapped to the chromosome level, which uncovered a consistent structural difference between R. melanosticta and other avian genomes. Our results are consistent with the observation that avian genomes are structurally conserved. Additionally, our results demonstrate the utility of linked-read sequencing for non-model genomics. Finally, we demonstrate the value of our R. melanosticta genome for future researchers by mapping reduced representation sequencing data, and by accurately reconstructing the phylogenetic relationships among a sample of thamnophilid species.

Download Full-text

Overlap graph-based generation of haplotigs for diploids and polyploids

10.1101/378356 ◽

2018 ◽

Author(s):

Jasmijn A. Baaijens ◽

Alexander Schönhuth

Keyword(s):

Recent Work ◽

Genome Assembly ◽

De Novo ◽

Iterative Scheme ◽

State Of The Art ◽

Simulated Data ◽

Specific Sequence ◽

New Approach ◽

Link Type ◽

Polyploid Genome

AbstractHaplotype aware genome assembly plays an important role in genetics, medicine, and various other disciplines, yet generation of haplotype-resolved de novo assemblies remains a major challenge. Beyond distinguishing between errors and true sequential variants, one needs to assign the true variants to the different genome copies. Recent work has pointed out that the enormous quantities of traditional NGS read data have been greatly underexploited in terms of haplotig computation so far, which reflects that methodology for reference independent haplotig computation has not yet reached maturity. We present POLYTE (POLYploid genome fitTEr) as a new approach to de novo generation of haplotigs for diploid and polyploid genomes. Our method follows an iterative scheme where in each iteration reads or contigs are joined, based on their interplay in terms of an underlying haplotype-aware overlap graph. Along the iterations, contigs grow while preserving their haplotype identity. Benchmarking experiments on both real and simulated data demonstrate that POLYTE establishes new standards in terms of error-free reconstruction of haplotype-specific sequence. As a consequence, POLYTE outperforms state-of-the-art approaches in various relevant aspects, where advantages become particularly distinct in polyploid settings. POLYTE is freely available as part of the HaploConduct package at https://github.com/HaploConduct/HaploConduct, implemented in Python and C++.

Download Full-text

OMGS: Optical Map-based Genome Scaffolding

10.1101/585794 ◽

2019 ◽

Author(s):

Weihua Pan ◽

Tao Jiang ◽

Stefano Lonardi

Keyword(s):

Genome Assembly ◽

Large Scale ◽

De Novo ◽

Optimization Problems ◽

Sequence Assembly ◽

Genome Scaffolding ◽

Contig Sequence ◽

Optical Map ◽

Linkage Information ◽

Optical Maps

AbstractDue to the current limitations of sequencing technologies,de novogenome assembly is typically carried out in two stages, namely contig (sequence) assembly and scaffolding. While scaffolding is computationally easier than sequence assembly, the scaffolding problem can be challenging due to the high repetitive content of eukaryotic genomes, possible mis-joins in assembled contigs and inaccuracies in the linkage information. Genome scaffolding tools either use paired-end/mate-pair/linked/Hi-C reads or genome-wide maps (optical, physical or genetic) as linkage information. Optical maps (in particular Bionano Genomics maps) have been extensively used in many recent large-scale genome assembly projects (e.g., goat, apple, barley, maize, quinoa, sea bass, among others). However, the most commonly used scaffolding tools have a serious limitation: they can only deal with one optical map at a time, forcing users to alternate or iterate over multiple maps. In this paper, we introduce a novel scaffolding algorithm called OMGS that for the first time can take advantages of multiple optical maps. OMGS solves several optimization problems to generate scaffolds with optimal contiguity and correctness. Extensive experimental results demonstrate that our tool outperforms existing methods when multiple optical maps are available, and produces comparable scaffolds using a single optical map. OMGS can be obtained fromhttps://github.com/ucrbioinfo/OMGS

Download Full-text

Correction: Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results

PLoS ONE ◽

10.1371/annotation/bb125f93-80d3-4dd1-adfe-03d9fb740f3b ◽

2011 ◽

Vol 6 (10) ◽

Author(s):

Niina Haiminen ◽

David N. Kuhn ◽

Laxmi Parida ◽

Isidore Rigoutsos

Keyword(s):

High Throughput ◽

Genome Assembly ◽

High Throughput Sequencing ◽

De Novo ◽

De Novo Genome Assembly ◽

Evaluation Of Methods

Download Full-text

Correction: Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results

PLoS ONE ◽

10.1371/annotation/176d83be-ed67-4205-9265-7208792d3dcf ◽

2011 ◽

Vol 6 (10) ◽

Author(s):

Niina Haiminen ◽

David N. Kuhn ◽

Laxmi Parida ◽

Isidore Rigoutsos

Keyword(s):

High Throughput ◽

Genome Assembly ◽

High Throughput Sequencing ◽

De Novo ◽

De Novo Genome Assembly ◽

Evaluation Of Methods

Download Full-text

A chromosome-level assembly of the Atlantic herring – detection of a supergene and other signals of selection

10.1101/668384 ◽

2019 ◽

Cited By ~ 3

Author(s):

Mats E. Pettersson ◽

Christina M. Rochus ◽

Fan Han ◽

Junfeng Chen ◽

Jason Hill ◽

...

Keyword(s):

Genetic Differentiation ◽

Genome Assembly ◽

Large Scale ◽

De Novo ◽

Hybrid Approach ◽

Ecological Adaptation ◽

Atlantic Herring ◽

Total Size ◽

Bony Fishes ◽

Chromosome Level

ABSTRACTThe Atlantic herring is a model species for exploring the genetic basis for ecological adaptation, due to its huge population size and extremely low genetic differentiation at selectively neutral loci. However, such studies have so far been hampered because of a highly fragmented genome assembly. Here, we deliver a chromosome-level genome assembly based on a hybrid approach combining ade novoPacBio assembly with Hi-C-supported scaffolding. The assembly comprises 26 autosomes with sizes ranging from 12.4 to 33.1 Mb and a total size, in chromosomes, of 726 Mb. The development of a high-resolution linkage map confirmed the global chromosome organization and the linear order of genomic segments along the chromosomes. A comparison between the herring genome assembly with other high-quality assemblies from bony fishes revealed few interchromosomal but frequent intrachromosomal rearrangements. The improved assembly makes the analysis of previously intractable large-scale structural variation more feasible; allowing, for example, the detection of a 7.8 Mb inversion on chromosome 12 underlying ecological adaptation. This supergene shows strong genetic differentiation between populations from the northern and southern parts of the species distribution. The chromosome-based assembly also markedly improves the interpretation of previously detected signals of selection, allowing us to reveal hundreds of independent loci associated with ecological adaptation in the Atlantic herring.

Download Full-text

SLAF-seq: An Efficient Method of Large-Scale De Novo SNP Discovery and Genotyping Using High-Throughput Sequencing

PLoS ONE ◽

10.1371/journal.pone.0058700 ◽

2013 ◽

Vol 8 (3) ◽

pp. e58700 ◽

Cited By ~ 381

Author(s):

Xiaowen Sun ◽

Dongyuan Liu ◽

Xiaofeng Zhang ◽

Wenbin Li ◽

Hui Liu ◽

...

Keyword(s):

High Throughput ◽

Efficient Method ◽

Large Scale ◽

High Throughput Sequencing ◽

De Novo ◽

Snp Discovery

Download Full-text

An SNP-Based High-Density Genetic Linkage Map for Tetraploid Potato Using Specific Length Amplified Fragment Sequencing (SLAF-Seq) Technology

Agronomy ◽

10.3390/agronomy10010114 ◽

2020 ◽

Vol 10 (1) ◽

pp. 114 ◽

Cited By ~ 2

Author(s):

Xiaoxia Yu ◽

Mingfei Zhang ◽

Zhuo Yu ◽

Dongsheng Yang ◽

Jingwei Li ◽

...

Keyword(s):

Linkage Map ◽

Genetic Linkage ◽

Large Scale ◽

High Throughput Sequencing ◽

Genetic Linkage Map ◽

De Novo ◽

Snp Markers ◽

High Density ◽

Tetraploid Potato ◽

Specific Length

Specific length amplified fragment sequencing (SLAF-seq) is a recently developed high-resolution strategy for the discovery of large-scale de novo genotyping of single nucleotide polymorphism (SNP) markers. In the present research, in order to facilitate genome-guided breeding in potato, this strategy was used to develop a large number of SNP markers and construct a high-density genetic linkage map for tetraploid potato. The genomic DNA extracted from 106 F1 individuals derived from a cross between two tetraploid potato varieties YSP-4 × MIN-021 and their parents was used for high-throughput sequencing and SLAF library construction. A total of 556.71 Gb data, which contained 2269.98 million pair-end reads, were obtained after preprocessing. According to bioinformatics analysis, a total of 838,604 SLAF labels were developed, with an average sequencing depth of 26.14-fold for parents and 15.36-fold for offspring of each SLAF, respectively. In total, 113,473 polymorphic SLAFs were obtained, from which 7638 SLAFs were successfully classified into four segregation patterns. After filtering, a total of 7329 SNP markers were detected for genetic map construction. The final integrated linkage map of tetraploid potato included 3001 SNP markers on 12 linkage groups, and covered 1415.88 cM, with an average distance of 0.47 cM between adjacent markers. To our knowledge, the integrated map described herein has the best coverage of the potato genome and the highest marker density for tetraploid potato. This work provides a foundation for further quantitative trait loci (QTL) location, map-based gene cloning of important traits and marker-assisted selection (MAS) of potato.

Download Full-text