scholarly journals Reference Quality Assembly of the 3.5 Gb genome of Capsicum annuum from a Single Linked-Read Library

2017 ◽  
Author(s):  
Amanda M. Hulse-Kemp ◽  
Shamoni Maheshwari ◽  
Kevin Stoffel ◽  
Theresa A. Hill ◽  
David Jaffe ◽  
...  

AbstractBackgroundLinked-Read sequencing technology has recently been employed successfully for de novo assembly of multiple human genomes, however the utility of this technology for complex plant genomes is unproven. We evaluated the technology for this purpose by sequencing the 3.5 gigabase (Gb) diploid pepper (Capsicum annuum) genome with a single Linked-Read library. Plant genomes, including pepper, are characterized by long, highly similar repetitive sequences. Accordingly, significant effort is used to ensure the sequenced plant is highly homozygous and the resulting assembly is a haploid consensus. With a phased assembly approach, we targeted a heterozygous F1 derived from a wide cross to assess the ability to derive both haplotypes for a pungency gene characterized by a large insertion/deletion.ResultsThe Supernova software generated a highly ordered, more contiguous sequence assembly than all currently available C. annuum reference genomes. Eighty-four percent of the final assembly was anchored and oriented using four de novo linkage maps. A comparison of the annotation of conserved eukaryotic genes indicated the completeness of assembly. The validity of the phased assembly is further demonstrated with the complete recovery of both 2.5 kb insertion/deletion haplotypes of the PUN1 locus in the F1 sample that represents pungent and non-pungent peppers.ConclusionsThe most contiguous pepper genome assembly to date has been generated through this work which demonstrates that Linked-Read library technology provides a rapid tool to assemble de novo complex highly repetitive heterozygous plant genomes. This technology can provide an opportunity to cost-effectively develop high-quality reference genome assemblies for other complex plants and compare structural and gene differences through accurate haplotype reconstruction.

2018 ◽  
Author(s):  
Michael J Roach ◽  
Simon Schmidt ◽  
Anthony R Borneman

AbstractRecent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembling highly heterozygous genomes is still facing a major problem where the two haplotypes for a region are highly polymorphic and the synteny is not recognised during assembly. This causes issues with downstream analysis, for example variant discovery using the haploid assembly, or haplotype reconstruction using the diploid assembly. A new pipeline—Purge Haplotigs—was developed specifically for third-gen assemblies to identify and reassign the duplicate contigs. The pipeline takes a draft haplotype-fused assembly or a diploid assembly, and read alignments to produce an improved assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing. All assemblies after processing with Purge Haplotigs were less duplicated with minimal impact on genome completeness. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence.


2020 ◽  
Vol 10 (10) ◽  
pp. 3811-3819 ◽  
Author(s):  
Austin Compton ◽  
Jiangtao Liang ◽  
Chujia Chen ◽  
Varvara Lukyanchikova ◽  
Yumin Qi ◽  
...  

Chromosome level assemblies are accumulating in various taxonomic groups including mosquitoes. However, even in the few reference-quality mosquito assemblies, a significant portion of the heterochromatic regions including telomeres remain unresolved. Here we produce a de novo assembly of the New World malaria mosquito, Anopheles albimanus by integrating Oxford Nanopore sequencing, Illumina, Hi-C and optical mapping. This 172.6 Mbps female assembly, which we call AalbS3, is obtained by scaffolding polished large contigs (contig N50 = 13.7 Mbps) into three chromosomes. All chromosome arms end with telomeric repeats, which is the first in mosquito assemblies and represents a significant step toward the completion of a genome assembly. These telomeres consist of tandem repeats of a novel 30-32 bp Telomeric Repeat Unit (TRU) and are confirmed by analyzing the termini of long reads and through both chromosomal in situ hybridization and a Bal31 sensitivity assay. The AalbS3 assembly included previously uncharacterized centromeric and rDNA clusters and more than doubled the content of transposable elements and other repetitive sequences. This telomere-to-telomere assembly, although still containing gaps, represents a significant step toward resolving biologically important but previously hidden genomic components. The comparison of different scaffolding methods will also inform future efforts to obtain reference-quality genomes for other mosquito species.


Author(s):  
Richard Finkers ◽  
Martijn van Kaauwen ◽  
Kai Ament ◽  
Karin Burger-Meijer ◽  
Raymond Egging ◽  
...  

Abstract Onion is an important vegetable crop with an estimated genome size of 16 Gb. We describe the de novo assembly and ab initio annotation of the genome of a doubled haploid onion line DHCU066619, which resulted in a final assembly of 14.9 Gb with a N50 of 464 Kb. Of this, 2.4 Gb was ordered into 8 pseudomolecules using four genetic linkage maps. The remainder of the genome is available in 89.6 K scaffolds. Only 72.4% of the genome could be identified as repetitive sequences and consist, to a large extent, of (retro) transposons. In addition, an estimated 20% of the putative (retro) transposons had accumulated a large number of mutations, hampering their identification, but facilitating their assembly. These elements are probably already quite old. The ab initio gene prediction indicated 540,925 putative gene models, which is far more than expected, possibly due to the presence of pseudogenes. Of these models, 47,066 showed RNASeq support. No gene rich regions were found, genes are uniformly distributed over the genome. Analysis of synteny with A. sativum (garlic) showed collinearity but also major rearrangements between both species. This assembly is the first high-quality genome sequence available for the study of onion and will be a valuable resource for further research.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Myung-Shin Kim ◽  
Geun Young Chae ◽  
Soohyun Oh ◽  
Jihyun Kim ◽  
Hyunggon Mang ◽  
...  

Abstract Background Peppers (Capsicum annuum L.) containing distinct capsaicinoids are the most widely cultivated spices in the world. However, extreme genomic diversity among species represents an obstacle to breeding pepper. Results Here, we report de novo genome assemblies of Capsicum annuum ‘Early Calwonder (non-pungent, ECW)’ and ‘Small Fruit (pungent, SF)’ along with their annotations. In total, we assembled 2.9 Gb of ECW and SF genome sequences, representing over 91% of the estimated genome sizes. Structural and functional annotation of the two pepper genomes generated about 35,000 protein-coding genes each, of which 93% were assigned putative functions. Comparison between newly and publicly available pepper gene annotations revealed both shared and specific gene content. In addition, a comprehensive analysis of nucleotide-binding and leucine-rich repeat (NLR) genes through whole-genome alignment identified five significant regions of NLR copy number variation (CNV). Detailed comparisons of those regions revealed that these CNVs were generated by intra-specific genomic variations that accelerated diversification of NLRs among peppers. Conclusions Our analyses unveil an evolutionary mechanism responsible for generating CNVs of NLRs among pepper accessions, and provide novel genomic resources for functional genomics and molecular breeding of disease resistance in Capsicum species.


Life ◽  
2021 ◽  
Vol 11 (11) ◽  
pp. 1234
Author(s):  
Avi Titievsky ◽  
Yuliya A. Putintseva ◽  
Elizaveta A. Taranenko ◽  
Sofya Baskin ◽  
Natalia V. Oreshkova ◽  
...  

Repetitive elements (RE) and transposons (TE) can comprise up to 80% of some plant genomes and may be essential for regulating their evolution and adaptation. The “repeatome” information is often unavailable in assembled genomes because genomic areas of repeats are challenging to assemble and are often missing from final assembly. However, raw genomic sequencing data contain rich information about RE/TEs. Here, raw genomic NGS reads of 10 gymnosperm species were studied for the content and abundance patterns of their “repeatome”. We utilized a combination of alignment on databases of repetitive elements and de novo assembly of highly repetitive sequences from genomic sequencing reads to characterize and calculate the abundance of known and putative repetitive elements in the genomes of 10 conifer plants: Pinus taeda, Pinus sylvestris, Pinus sibirica, Picea glauca, Picea abies, Abies sibirica, Larix sibirica, Juniperus communis, Taxus baccata, and Gnetum gnemon. We found that genome abundances of known and newly discovered putative repeats are specific to phylogenetically close groups of species and match biological taxa. The grouping of species based on abundances of known repeats closely matches the grouping based on abundances of newly discovered putative repeats (kChains) and matches the known taxonomic relations.


2018 ◽  
Author(s):  
Shivani Mahajan ◽  
Kevin Wei ◽  
Matthew Nalley ◽  
Lauren Giblisco ◽  
Doris Bachtrog

While short-read sequencing technology has resulted in a sharp increase in the number of species with genome assemblies, these assemblies are typically highly fragmented. Repeats pose the largest challenge for reference genome assembly, and pericentromeric regions and the repeat-rich Y chromosome are typically ignored from sequencing projects. Here, we assemble the genome of Drosophila miranda using long reads for contig formation, chromatin interaction maps for scaffolding and short reads, optical mapping and BAC clone sequencing for consensus validation. Our assembly recovers entire chromosomes and contains large fractions of repetitive DNA, including ~41.5 Mb of pericentromeric and telomeric regions, and >100Mb of the recently formed highly repetitive neo-Y chromosome. While Y chromosome evolution is typically characterized by global sequence loss and shrinkage, the neo-Y increased in size by almost 3-fold, due to the accumulation of repetitive sequences. Our high-quality assembly allows us to reconstruct the chromosomal events that have led to the unusual sex chromosome karyotype in D. miranda, including the independent de novo formation of a pair of sex chromosomes at two distinct time points, or the reversion of a former Y chromosome to an autosome.


GigaScience ◽  
2021 ◽  
Vol 10 (3) ◽  
Author(s):  
Anton Zamyatin ◽  
Pavel Avdeyev ◽  
Jiangtao Liang ◽  
Atashi Sharma ◽  
Chujia Chen ◽  
...  

Abstract Background Anopheles coluzzii and Anopheles arabiensis belong to the Anopheles gambiae complex and are among the major malaria vectors in sub-Saharan Africa. However, chromosome-level reference genome assemblies are still lacking for these medically important mosquito species. Findings In this study, we produced de novo chromosome-level genome assemblies for A. coluzzii and A. arabiensis using the long-read Oxford Nanopore sequencing technology and the Hi-C scaffolding approach. We obtained 273.4 and 256.8 Mb of the total assemblies for A. coluzzii and A. arabiensis, respectively. Each assembly consists of 3 chromosome-scale scaffolds (X, 2, 3), complete mitochondrion, and unordered contigs identified as autosomal pericentromeric DNA, X pericentromeric DNA, and Y sequences. Comparison of these assemblies with the existing assemblies for these species demonstrated that we obtained improved reference-quality genomes. The new assemblies allowed us to identify genomic coordinates for the breakpoint regions of fixed and polymorphic chromosomal inversions in A. coluzzii and A. arabiensis. Conclusion The new chromosome-level assemblies will facilitate functional and population genomic studies in A. coluzzii and A. arabiensis. The presented assembly pipeline will accelerate progress toward creating high-quality genome references for other disease vectors.


Author(s):  
Anton Zamyatin ◽  
Pavel Avdeyev ◽  
Jiangtao Liang ◽  
Atashi Sharma ◽  
Chujia Chen ◽  
...  

AbstractBackgroundAnopheles coluzzii and An. arabiensis belong to the An. gambiae complex and are among the major malaria vectors in Sub-Saharan Africa. However, chromosome-level reference genome assemblies are still lacking for these medically important mosquito species.FindingsIn this study, we produced de novo chromosome-level genome assemblies for An. coluzzii and An. arabiensis using the long-read Oxford Nanopore sequencing technology and the Hi-C scaffolding approach. We obtained 273.4 Mbp and 265.7 Mbp assemblies for An. coluzzii and An. arabiensis, respectively. Each assembly consists of three chromosome-scale scaffolds (X, 2, 3), complete mitochondrion, and unordered contigs identified as autosomal pericentromeric DNA, X pericentromeric DNA, and Y sequences. Comparison of these assemblies with the existing assemblies for these species demonstrated that we obtained improved reference-quality genomes. The new assemblies allowed us to identify genomic coordinates for the breakpoint regions of fixed and polymorphic chromosomal inversions in An. coluzzii and An. arabiensis.ConclusionThe new chromosome-level assemblies will facilitate functional and population genomic studies in An. coluzzii and An. arabiensis. The presented assembly pipeline will accelerate progress toward creating high-quality genome references for other disease vectors.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Arang Rhie ◽  
Brian P. Walenz ◽  
Sergey Koren ◽  
Adam M. Phillippy

Abstract Recent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for reference-free assembly evaluation based on efficient k-mer set operations. By comparing k-mers in a de novo assembly to those found in unassembled high-accuracy reads, Merqury estimates base-level accuracy and completeness. For trios, Merqury can also evaluate haplotype-specific accuracy, completeness, phase block continuity, and switch errors. Multiple visualizations, such as k-mer spectrum plots, can be generated for evaluation. We demonstrate on both human and plant genomes that Merqury is a fast and robust method for assembly validation.


2020 ◽  
Vol 11 ◽  
Author(s):  
Soma S. Marla ◽  
Pallavi Mishra ◽  
Ranjeet Maurya ◽  
Mohar Singh ◽  
Dhammaprakash Pandhari Wankhede ◽  
...  

Genome assembly of short reads from large plant genomes remains a challenge in computational biology despite major developments in next generation sequencing. Of late several draft assemblies have been reported in sequenced plant genomes. The reported draft genome assemblies of Cajanus cajan have different levels of genome completeness, a large number of repeats, gaps, and segmental duplications. Draft assemblies with portions of genome missing are shorter than the referenced original genome. These assemblies come with low map accuracy affecting further functional annotation and the prediction of gene components as desired by crop researchers. Genome coverage, i.e., the number of sequenced raw reads mapped onto a certain location of the genome is an important quality indicator of completeness and assembly quality in draft assemblies. The present work aimed to improve the coverage in reported de novo sequenced draft genomes (GCA_000340665.1 and GCA_000230855.2) of pigeonpea, a legume widely cultivated in India. The two recently sequenced assemblies, A1 and A2 comprised 72% and 75% of the estimated coverage of the genome, respectively. We employed an assembly reconciliation approach to compare the draft assemblies and merge them, filling the gaps by employing an algorithm size sorting mate-pair library to generate a high quality and near complete assembly with enhanced contiguity. The majority of gaps present within scaffolds were filled with right-sized mate-pair reads. The improved assembly reduced the number of gaps than those reported in draft assemblies resulting in an improved genome coverage of 82.4%. Map accuracy of the improved assembly was evaluated using various quality metrics and for the presence of specific trait-related functional genes. Employed pair-end and mate-pair local libraries helped us to reduce gaps, repeats, and other sequence errors resulting in lengthier scaffolds compared to the two draft assemblies. We reported the prediction of putative host resistance genes against Fusarium wilt disease by their performance and evaluated them both in wet laboratory and field phenotypic conditions.


Sign in / Sign up

Export Citation Format

Share Document