Assembly by Reduced Complexity (ARC): a hybrid approach for targeted assembly of homologous sequences.

Mapping Intimacies ◽

10.1101/014662 ◽

2015 ◽

Cited By ~ 17

Author(s):

Samual S Hunter ◽

Robert T Lyon ◽

Brice A.J. Sarver ◽

Kayla Hardwick ◽

Larry J Forney ◽

...

Keyword(s):

De Novo Assembly ◽

High Throughput Sequencing ◽

De Novo ◽

Hybrid Approach ◽

Reference Sequence ◽

Model Organisms ◽

Exome Capture ◽

Mitochondrial Genomes ◽

Homologous Sequences ◽

Reduced Complexity

Analysis of High-throughput sequencing (HTS) data is a difficult problem, especially in the context of non-model organisms where comparison of homologous sequences may be hindered by the lack of a close reference genome. Current mapping-based methods rely on the availability of a highly similar reference sequence, whereas de novo assemblies produce anonymous (unannotated) contigs that are not easily compared across samples. Here, we present Assembly by Reduced Complexity (ARC) a hybrid mapping and assembly approach for targeted assembly of homologous sequences. ARC is an open-source project (http://ibest.github.io/ARC/) implemented in the Python language and consists of the following stages: 1) align sequence reads to reference targets, 2) use alignment results to distribute reads into target specific bins, 3) perform assemblies for each bin (target) to produce contigs, and 4) replace previous reference targets with assembled contigs and iterate. We show that ARC is able to assemble high quality, unbiased mitochondrial genomes seeded from 11 progressively divergent references, and is able to assemble full mitochondrial genomes starting from short, poor quality ancient DNA reads. We also show ARC compares favorably to de novo assembly of a large exome capture dataset for CPU and memory requirements; assembling 7,627 individual targets across 55 samples, completing over 1.3 million assemblies in less than 78 hours, while using under 32 Gb of system memory. ARC breaks the assembly problem down into many smaller problems, solving the anonymous contig and poor scaling inherent in some de novo assembly methods and reference bias inherent in traditional read mapping.

Download Full-text

Mitochondrial genome assembly v1

10.17504/protocols.io.bqzbmx2n ◽

2020 ◽

Author(s):

Graham Etherington

Keyword(s):

Mitochondrial Genome ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Mitochondrial Genomes

De novo assembly of 49 mustelid whole mitochondrial genomes

Download Full-text

Index-Free De Novo Assembly and Deconvolution of Mixed Mitochondrial Genomes

Genome Biology and Evolution ◽

10.1093/gbe/evq029 ◽

2010 ◽

Vol 2 (0) ◽

pp. 410-424 ◽

Cited By ~ 18

Author(s):

B. J. McComish ◽

S. F. K. Hills ◽

P. J. Biggs ◽

D. Penny

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Mitochondrial Genomes

Download Full-text

De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs

2012 IEEE Fifth International Conference on Cloud Computing ◽

10.1109/cloud.2012.123 ◽

2012 ◽

Cited By ~ 5

Author(s):

Yu-Jung Chang ◽

Chien-Chih Chen ◽

Jan-Ming Ho ◽

Chuen-Liang Chen

Keyword(s):

Cloud Computing ◽

High Throughput ◽

De Novo Assembly ◽

High Throughput Sequencing ◽

De Novo ◽

Sequencing Data ◽

String Graphs ◽

High Throughput Sequencing Data

Download Full-text

RAD Paired-End Sequencing for Local De Novo Assembly and SNP Discovery in Non-model Organisms

Data Production and Analysis in Population Genomics - Methods in Molecular Biology™ ◽

10.1007/978-1-61779-870-2_9 ◽

2012 ◽

pp. 135-151 ◽

Cited By ~ 11

Author(s):

Paul D. Etter ◽

Eric Johnson

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Model Organisms ◽

Snp Discovery ◽

Paired End Sequencing

Download Full-text

First report of Rose leaf rosette-associated virus infecting rose (Rosa spp.) in California, USA

Plant Disease ◽

10.1094/pdis-10-20-2268-pdn ◽

2021 ◽

Author(s):

Nourolah Soltani ◽

Deborah Anne Golino ◽

Maher Al Rwahnih

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Rna Isolation ◽

Complete Sequence ◽

Nucleotide Sequences ◽

Ringspot Virus ◽

Reference Sequence ◽

Rosa Multiflora ◽

First Report ◽

Putative Coat Protein

Rose leaf rosette-associated virus (RLRaV) is a member of genus Closterovirus, family Closteroviridae. The virus was first discovered in China in 2015 from a mixed infected wild rose (Rosa multiflora Thunb.) showing small leaf rosettes on branches, dieback and severe decline symptoms (He et al. 2015). In 2013, a rose plant (cv. Roses Are Red) was introduced to Foundation Plant Services (FPS, UC-Davis) rose collection. The plant was originated from a private rose breeder collection located in California. In 2019, total nucleic acids (TNA) were isolated from leaf tissues of one asymptomatic plant (Roses Are Red plant) using MagMax Plant RNA Isolation Kit (Thermo Fisher Scientific, USA). Extracted TNA were screened by reverse-transcription quantitative PCR (RT-qPCR) for six common viruses infecting roses, including prunus necrotic ringspot virus (PNRSV), apple mosaic virus (ApMV), rose spring dwarf associated virus (RSDaV), rose yellow vein virus (RYVV), rose rosette virus (RRV), and blackberry chlorotic ringspot virus (BCRV); however, the results were negative. Therefore, the sample was subjected to high throughput sequencing (HTS). Briefly, TNA was depleted of rRNA and advanced for cDNA library preparation using TruSeq Stranded Total RNA kit (Illumina, USA). HTS was performed on Illumina NextSeq 500 platform. The raw reads were trimmed, de novo assembled, and subsequently were annotated using tBLASTx algorithm (Al Rwahnih et al. 2018). HTS generated 23.6 million 75 nucleotide (nt) single-end raw data reads. De novo assembly generated a contig (16,528 nts) resembling RLRaV reference sequence (KJ748003) with 74% identity at the nucleotide level. Putative coat protein and heat shock protein 70-like protein were identified based on >90% identity with RLRaV genes. To confirm HTS results, RT-PCR was performed using two primer sets, 1) Clo-F4916 (5’-GGTGTTCCAACGCTATCGTG-3’) and Clo-R5215 (5’- TGTCCTCAAACCGCCTACAT-3’) targeting nucleotide sequences of putative polyprotein 1a, and 2) Clo-F10006 (5’-GATTCCGCGGACGAATTAAT-3’) and Clo-R10311 (5’-GGTAACCGAAAGGTAAAGTATTC-3’) targeting nucleotide sequences of putative protein p25. The RLRaV amplicons with expected size of 300 nt were confirmed using bidirectional Sanger sequencing. The near-complete sequence of the new RLRaV isolate was deposited in GenBank under accession number MW056181. In addition, HTS analysis showed that RLRaV was in mixed infection with two mycoviruses (rose cryptic virus with 8,267 mapped reads and rose partitivirus with 7,283 mapped readss). To our knowledge, this is the first report of RLRaV affecting roses in California. Further research is needed to determine the prevalence of RLRaV in California as well as evaluation of RLRaV effect on rose performance.

Download Full-text

De novo assembly of mitochondrial genomes provides insights into genetic diversity and molecular evolution in wild boars and domestic pigs

Genetica ◽

10.1007/s10709-018-0018-y ◽

2018 ◽

Vol 146 (3) ◽

pp. 277-285

Author(s):

Pan Ni ◽

Ali Akbar Bhuiyan ◽

Jian-Hai Chen ◽

Jingjin Li ◽

Cheng Zhang ◽

...

Keyword(s):

Genetic Diversity ◽

Molecular Evolution ◽

De Novo Assembly ◽

De Novo ◽

Mitochondrial Genomes ◽

Wild Boars ◽

Domestic Pigs

Download Full-text

mitoMaker: A Pipeline for Automatic Assembly and Annotation of Animal Mitochondria Using Raw NGS Data

10.20944/preprints201808.0423.v1 ◽

2018 ◽

Cited By ~ 5

Author(s):

Alex Schomaker-Bastos ◽

Francisco Prosdocimi

Keyword(s):

Next Generation Sequencing ◽

Mitochondrial Genome ◽

De Novo Assembly ◽

De Novo ◽

Mitochondrial Genomes ◽

Automatic Assembly ◽

Fastq Format ◽

Ngs Data ◽

Generation Sequencing ◽

Animal Genomes

Next-generation sequencing is now a mature technology, allowing partial animal genomes to be produced for many clades. Though many software exist for genome assembly and annotation, a simple pipeline that allows researchers to input raw sequencing reads in fastq format and allow the retrieval of a completely assembled and annotated mitochondrial genome is still missing. mitoMaker 1.0 is a pipeline developed in python that implements (i) recursive de novo assembly of mitochondrial genomes using a set of increasing k-mers; (ii) search for the best matching result to a target mitogenome and; (iii) performs iterative reference-based strategies to optimize the assembly. After (iv) checking for circularization and (v) positioning tRNA-Phe at the beginning, (vi) geneChecker.py module performs a complete annotation of the mitochondrial genome and provides a GenBank formatted file as output.

Download Full-text

Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression

Bioinformatics ◽

10.1093/bioinformatics/bty936 ◽

2018 ◽

Vol 35 (12) ◽

pp. 2066-2074 ◽

Cited By ~ 11

Author(s):

Yuansheng Liu ◽

Zuguo Yu ◽

Marcel E Dinger ◽

Jinyan Li

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Reference Sequence ◽

Supplementary Information ◽

The Novel ◽

Rna Seq ◽

File Size ◽

Sequencing Technologies ◽

Efficient Storage ◽

Merging Process

Abstract Motivation Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix–prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20–80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. Availability and implementation https://github.com/yuansliu/minicom Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

High-Throughput Sequencing and De Novo Assembly of Red and Green Forms of the Perilla frutescens var. crispa Transcriptome

PLoS ONE ◽

10.1371/journal.pone.0129154 ◽

2015 ◽

Vol 10 (6) ◽

pp. e0129154 ◽

Cited By ~ 25

Author(s):

Atsushi Fukushima ◽

Michimi Nakamura ◽

Hideyuki Suzuki ◽

Kazuki Saito ◽

Mami Yamazaki

Keyword(s):

High Throughput ◽

De Novo Assembly ◽

High Throughput Sequencing ◽

De Novo ◽

Perilla Frutescens

Download Full-text

Allele Phasing Greatly Improves the Phylogenetic Utility of Ultraconserved Elements

10.1101/255752 ◽

2018 ◽

Author(s):

Tobias Andermann ◽

Alexandre M. Fernandes ◽

Urban Olsson ◽

Mats Töpel ◽

Bernard Pfeil ◽

...

Keyword(s):

Statistical Power ◽

High Throughput Sequencing ◽

De Novo ◽

Allelic Variation ◽

Sequencing Depth ◽

High Rate ◽

Model Organisms ◽

Sequence Capture ◽

Phylogenetic Studies ◽

Biogeographic History

AbstractAdvances in high-throughput sequencing techniques now allow relatively easy and affordable sequencing of large portions of the genome, even for non-model organisms. Many phylogenetic studies reduce costs by focusing their sequencing efforts on a selected set of targeted loci, commonly enriched using sequence capture. The advantage of this approach is that it recovers a consistent set of loci, each with high sequencing depth, which leads to more confidence in the assembly of target sequences. High sequencing depth can also be used to identify phylogenetically informative allelic variation within sequenced individuals, but allele sequences are infrequently assembled in phylogenetic studies.Instead, many scientists perform their phylogenetic analyses using contig sequences which result from the de novo assembly of sequencing reads into contigs containing only canonical nucleobases, and this may reduce both statistical power and phylogenetic accuracy. Here, we develop an easy-to-use pipeline to recover allele sequences from sequence capture data, and we use simulated and empirical data to demonstrate the utility of integrating these allele sequences to analyses performed under the Multispecies Coalescent (MSC) model. Our empirical analyses of Ultraconserved Element (UCE) locus data collected from the South American hummingbird genus Topaza demonstrate that phased allele sequences carry sufficient phylogenetic information to infer the genetic structure, lineage divergence, and biogeographic history of a genus that diversified during the last three million years. The phylogenetic results support the recognition of two species, and suggest a high rate of gene flow across large distances of rainforest habitats but rare admixture across the Amazon River. Our simulations provide evidence that analyzing allele sequences leads to more accurate estimates of tree topology and divergence times than the more common approach of using contig sequences.

Download Full-text