scholarly journals Rapid low-cost assembly of the Drosophila melanogaster reference genome using low-coverage, long-read sequencing

2018 ◽  
Author(s):  
Edwin A. Solares ◽  
Mahul Chakraborty ◽  
Danny E. Miller ◽  
Shannon Kalsow ◽  
Kate Hall ◽  
...  

ABSTRACTAccurate and comprehensive characterization of genetic variation is essential for deciphering the genetic basis of diseases and other phenotypes. A vast amount of genetic variation stems from large-scale sequence changes arising from the duplication, deletion, inversion, and translocation of sequences. In the past 10 years, high-throughput short reads have greatly expanded our ability to assay sequence variation due to single nucleotide polymorphisms. However, a recent de novo assembly of a second Drosophila melanogaster reference genome has revealed that short read genotyping methods miss hundreds of structural variants, including those affecting phenotypes. While genomes assembled using high-coverage long reads can achieve high levels of contiguity and completeness, concerns about cost, errors, and low yield have limited widespread adoption of such sequencing approaches. Here we resequenced the reference strain of D. melanogaster (ISO1) on a single Oxford Nanopore MinION flow cell run for 24 hours. Using only reads longer than 1 kb or with at least 30x coverage, we assembled a highly contiguous de novo genome. The addition of inexpensive paired reads and subsequent scaffolding using an optical map technology achieved an assembly with completeness and contiguity comparable to the D. melanogaster reference assembly. Comparison of our assembly to the reference assembly of ISO1 uncovered a number of structural variants (SVs), including novel LTR transposable element insertions and duplications affecting genes with developmental, behavioral, and metabolic functions. Collectively, these SVs provide a snapshot of the dynamics of genome evolution. Furthermore, our assembly and comparison to the D. melanogaster reference genome demonstrates that high-quality de novo assembly of reference genomes and comprehensive variant discovery using such assemblies are now possible by a single lab for under $1,000 (USD).

2021 ◽  
Author(s):  
Myung-Shin Kim ◽  
Taeyoung Lee ◽  
Jeonghun Baek ◽  
Ji Hong Kim ◽  
Changhoon Kim ◽  
...  

AbstractMassive resequencing efforts have been undertaken to catalog allelic variants in major crop species including soybean, but the scope of the information for genetic variation often depends on short sequence reads mapped to the extant reference genome. Additional de novo assembled genome sequences provide a unique opportunity to explore a dispensable genome fraction in the pan-genome of a species. Here, we report the de novo assembly and annotation of Hwangkeum, a popular soybean cultivar in Korea. The assembly was constructed using PromethION nanopore sequencing data and two genetic maps, and was then error-corrected using Illumina short-reads and PacBio SMRT reads. The 933.12 Mb assembly was annotated 79,870 transcripts for 58,550 genes using RNA-Seq data and the public soybean annotation set. Comparison of the Hwangkeum assembly with the Williams 82 soybean reference genome sequence revealed 1.8 million single-nucleotide polymorphisms, 0.5 million indels, and 25 thousand putative structural variants. However, there was no natural megabase-scale chromosomal rearrangement. Incidentally, by adding two novel groups, we found that soybean contains four clearly separated groups of centromeric satellite repeats. Analyses of satellite repeats and gene content suggested that the Hwangkeum assembly is a high-quality assembly. This was further supported by comparison of the marker arrangement of anthocyanin biosynthesis genes and of gene arrangement at the Rsv3 locus. Therefore, the results indicate that the de novo assembly of Hwangkeum is a valuable additional reference genome resource for characterizing traits for the improvement of this important crop species.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Baohua Chen ◽  
Zhixiong Zhou ◽  
Qiaozhen Ke ◽  
Yidi Wu ◽  
Huaqiang Bai ◽  
...  

Abstract Larimichthys crocea is an endemic marine fish in East Asia that belongs to Sciaenidae in Perciformes. L. crocea has now been recognized as an “iconic” marine fish species in China because not only is it a popular food fish in China, it is a representative victim of overfishing and still provides high value fish products supported by the modern large-scale mariculture industry. Here, we report a chromosome-level reference genome of L. crocea generated by employing the PacBio single molecule sequencing technique (SMRT) and high-throughput chromosome conformation capture (Hi-C) technologies. The genome sequences were assembled into 1,591 contigs with a total length of 723.86 Mb and a contig N50 length of 2.83 Mb. After chromosome-level scaffolding, 24 scaffolds were constructed with a total length of 668.67 Mb (92.48% of the total length). Genome annotation identified 23,657 protein-coding genes and 7262 ncRNAs. This highly accurate, chromosome-level reference genome of L. crocea provides an essential genome resource to support the development of genome-scale selective breeding and restocking strategies of L. crocea.


Insects ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 101
Author(s):  
Miao Wang ◽  
Hanyu Li ◽  
Huoqing Zheng ◽  
Liuwei Zhao ◽  
Xiaofeng Xue ◽  
...  

The invasion of Vespa velutina presents a great threat to the agriculture economy, the ecological environment, and human health. An effective strategy for this hornet control is urgently required, but the limited genome information of Vespa velutina restricts the application of molecular-genomic tools for targeted hornet management. Therefore, we conducted large-scale transcriptome profiling of the hornet brain to obtain functional target genes and molecular markers. Using an Illumina HiSeq platform, more than 41 million clean reads were obtained and de novo assembled into 182,087 meaningful unigenes. A total of 56,400 unigenes were annotated against publicly available protein sequence databases and a set of reliable Simple Sequence Repeats (SSRs) and Single Nucleotide Polymorphisms (SNP) markers were developed. The homologous genes encoding crucial behavior regulation factors, odorant binding proteins (OBPs), and vitellogenin, were also identified from highly expressed transcripts. This study provides abundant molecular targets and markers for invasive hornet control and further promotes the genetic and molecular study of Vespa velutina.


2018 ◽  
Vol 8 (10) ◽  
pp. 3143-3154 ◽  
Author(s):  
Edwin A. Solares ◽  
Mahul Chakraborty ◽  
Danny E. Miller ◽  
Shannon Kalsow ◽  
Kate Hall ◽  
...  

2019 ◽  
Author(s):  
Xin Zhou ◽  
Lu Zhang ◽  
Xiaodong Fang ◽  
Yichen Liu ◽  
David L. Dill ◽  
...  

AbstractHuman diploid genome assembly enables identifying maternal and paternal genetic variations. Algorithms based on 10x linked-read sequencing have been developed for de novo assembly, variant calling and haplotyping. Another linked-read technology, single tube long fragment read (stLFR), has recently provided a low-cost single tube solution that can enable long fragment data. However, no existing software is available for human diploid assembly and variant calls. We develop Aquila stLFR to adapt to the key characteristics of stLFR. Aquila stLFR assembles near perfect diploid assembled contigs, and the assembly-based variant calling shows that Aquila stLFR detects large numbers of structural variants which were not easily spanned by Illumina short-reads. Furthermore, the hybrid assembly mode Aquila hybrid allows a hybrid assembly based on both stLFR and 10x linked-reads libraries, demonstrating that these two technologies can always be complementary to each other for assembly to improve contiguity and the variants detection, regardless of assembly quality of the library itself from single sequencing technology. The overlapped structural variants (SVs) from two independent sequencing data of the same individual, and the SVs from hybrid assemblies provide us a high-confidence profile to study them.AvailabilitySource code and documentation are available on https://github.com/maiziex/Aquila_stLFR.


2019 ◽  
Author(s):  
Karen H. Miga ◽  
Sergey Koren ◽  
Arang Rhie ◽  
Mitchell R. Vollger ◽  
Ariel Gershman ◽  
...  

After nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has been finished end to end, and hundreds of unresolved gaps persist 1,2. The remaining gaps include ribosomal rDNA arrays, large near-identical segmental duplications, and satellite DNA arrays. These regions harbor largely unexplored variation of unknown consequence, and their absence from the current reference genome can lead to experimental artifacts and hide true variants when re-sequencing additional human genomes. Here we present a de novo human genome assembly that surpasses the continuity of GRCh38 2, along with the first gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome 3, we reconstructed the ∼2.8 megabase centromeric satellite DNA array and closed all 29 remaining gaps in the current reference, including new sequence from the human pseudoautosomal regions and cancer-testis ampliconic gene families (CT-X and GAGE). This complete chromosome X, combined with the ultra-long nanopore data, also allowed us to map methylation patterns across complex tandem repeats and satellite arrays for the first time. These results demonstrate that finishing the human genome is now within reach and will enable ongoing efforts to complete the remaining human chromosomes.


2021 ◽  
Author(s):  
Sonja Kersten ◽  
Jiyang Chang ◽  
Christian D Huber ◽  
Yoav Voichek ◽  
Christa Lanz ◽  
...  

Repeated herbicide applications exert enormous selection on blackgrass (Alopecurus myosuroides), a major weed in cereal crops of the temperate climate zone including Europe. This inadvertent large-scale experiment gives us the opportunity to look into the underlying genetic mechanisms and evolutionary processes of rapid adaptation, which can occur both through mutations in the direct targets of herbicides and through changes in other, often metabolic, pathways, known as non-target-site resistance. How much either type of adaptation relies on de novo mutations versus pre-existing standing variation is important for developing strategies to manage herbicide resistance. We generated a chromosome-level reference genome for A. myosuroides for population genomic studies of herbicide resistance and genome-wide diversity across Europe in this species. Bulked-segregant analysis evidenced that non-target-site resistance has a complex genetic architecture. Through empirical data and simulations, we showed that, despite its simple genetics, target-site resistance mainly results from standing genetic variation, with only a minor role for de novo mutations.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5261 ◽  
Author(s):  
Robert A. Petit ◽  
Timothy D. Read

Low-cost Illumina sequencing of clinically-important bacterial pathogens has generated thousands of publicly available genomic datasets. Analyzing these genomes and extracting relevant information for each pathogen and the associated clinical phenotypes requires not only resources and bioinformatic skills but organism-specific knowledge. In light of these issues, we created Staphopia, an analysis pipeline, database and application programming interface, focused on Staphylococcus aureus, a common colonizer of humans and a major antibiotic-resistant pathogen responsible for a wide spectrum of hospital and community-associated infections. Written in Python, Staphopia’s analysis pipeline consists of submodules running open-source tools. It accepts raw FASTQ reads as an input, which undergo quality control filtration, error correction and reduction to a maximum of approximately 100× chromosome coverage. This reduction significantly reduces total runtime without detrimentally affecting the results. The pipeline performs de novo assembly-based and mapping-based analysis. Automated gene calling and annotation is performed on the assembled contigs. Read-mapping is used to call variants (single nucleotide polymorphisms and insertion/deletions) against a reference S. aureus chromosome (N315, ST5). We ran the analysis pipeline on more than 43,000 S. aureus shotgun Illumina genome projects in the public European Nucleotide Archive database in November 2017. We found that only a quarter of known multi-locus sequence types (STs) were represented but the top 10 STs made up 70% of all genomes. methicillin-resistant S. aureus (MRSA) were 64% of all genomes. Using the Staphopia database we selected 380 high quality genomes deposited with good metadata, each from a different multi-locus ST, as a non-redundant diversity set for studying S. aureus evolution. In addition to answering basic science questions, Staphopia could serve as a potential platform for rapid clinical diagnostics of S. aureus isolates in the future. The system could also be adapted as a template for other organism-specific databases.


2020 ◽  
Author(s):  
Brianna Chrisman ◽  
Kelley Paskov ◽  
Nate Stockham ◽  
Kevin Tabatabaei ◽  
Jae-Yoon Jung ◽  
...  

ABSTRACTThe evolutionary dynamics of SARS-CoV-2 have been carefully monitored since the COVID-19 pandemic began in December 2019, however, analysis has focused primarily on single nucleotide polymorphisms and largely ignored the role of structural variants (SVs) as well as recombination in SARS-CoV-2 evolution. Using sequences from the GISAID database, we catalogue over 100 insertions and deletions in the SARS-CoV-2 consensus sequences. We hypothesize that these indels are artifacts of imperfect homologous recombination between SARS-CoV-2 replicates, and provide four independent pieces of evidence. (1) The SVs from the GISAID consensus sequences are clustered at specific regions of the genome. (2) These regions are also enriched for 5’ and 3’ breakpoints in the transcription regulatory site (TRS) independent transcriptome, presumably sites of RNA-dependent RNA polymerase (RdRp) template-switching. (3) Within raw reads, these structural variant hotspots have cases of both high intra-host heterogeneity and intra-host homogeneity, suggesting that these structural variants are both consequences of de novo recombination events within a host and artifacts of previous recombination. (4) Within the RNA secondary structure, the indels occur in “arms” of the predicted folded RNA, suggesting that secondary structure may be a mechanism for TRS-independent template-switching in SARS-CoV-2 or other coronaviruses. These insights into the relationship between structural variation and recombination in SARS-CoV-2 can improve our reconstructions of the SARS-CoV-2 evolutionary history as well as our understanding of the process of RdRp template-switching in RNA viruses.


2014 ◽  
Vol 43 (D1) ◽  
pp. D690-D697 ◽  
Author(s):  
G. dos Santos ◽  
A. J. Schroeder ◽  
J. L. Goodman ◽  
V. B. Strelets ◽  
M. A. Crosby ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document