scholarly journals Assembly by Reduced Complexity (ARC): a hybrid approach for targeted assembly of homologous sequences.

2015 ◽  
Author(s):  
Samual S Hunter ◽  
Robert T Lyon ◽  
Brice A.J. Sarver ◽  
Kayla Hardwick ◽  
Larry J Forney ◽  
...  

Analysis of High-throughput sequencing (HTS) data is a difficult problem, especially in the context of non-model organisms where comparison of homologous sequences may be hindered by the lack of a close reference genome. Current mapping-based methods rely on the availability of a highly similar reference sequence, whereas de novo assemblies produce anonymous (unannotated) contigs that are not easily compared across samples. Here, we present Assembly by Reduced Complexity (ARC) a hybrid mapping and assembly approach for targeted assembly of homologous sequences. ARC is an open-source project (http://ibest.github.io/ARC/) implemented in the Python language and consists of the following stages: 1) align sequence reads to reference targets, 2) use alignment results to distribute reads into target specific bins, 3) perform assemblies for each bin (target) to produce contigs, and 4) replace previous reference targets with assembled contigs and iterate. We show that ARC is able to assemble high quality, unbiased mitochondrial genomes seeded from 11 progressively divergent references, and is able to assemble full mitochondrial genomes starting from short, poor quality ancient DNA reads. We also show ARC compares favorably to de novo assembly of a large exome capture dataset for CPU and memory requirements; assembling 7,627 individual targets across 55 samples, completing over 1.3 million assemblies in less than 78 hours, while using under 32 Gb of system memory. ARC breaks the assembly problem down into many smaller problems, solving the anonymous contig and poor scaling inherent in some de novo assembly methods and reference bias inherent in traditional read mapping.

2020 ◽  
Author(s):  
Graham Etherington

De novo assembly of 49 mustelid whole mitochondrial genomes


2010 ◽  
Vol 2 (0) ◽  
pp. 410-424 ◽  
Author(s):  
B. J. McComish ◽  
S. F. K. Hills ◽  
P. J. Biggs ◽  
D. Penny

Plant Disease ◽  
2021 ◽  
Author(s):  
Nourolah Soltani ◽  
Deborah Anne Golino ◽  
Maher Al Rwahnih

Rose leaf rosette-associated virus (RLRaV) is a member of genus Closterovirus, family Closteroviridae. The virus was first discovered in China in 2015 from a mixed infected wild rose (Rosa multiflora Thunb.) showing small leaf rosettes on branches, dieback and severe decline symptoms (He et al. 2015). In 2013, a rose plant (cv. Roses Are Red) was introduced to Foundation Plant Services (FPS, UC-Davis) rose collection. The plant was originated from a private rose breeder collection located in California. In 2019, total nucleic acids (TNA) were isolated from leaf tissues of one asymptomatic plant (Roses Are Red plant) using MagMax Plant RNA Isolation Kit (Thermo Fisher Scientific, USA). Extracted TNA were screened by reverse-transcription quantitative PCR (RT-qPCR) for six common viruses infecting roses, including prunus necrotic ringspot virus (PNRSV), apple mosaic virus (ApMV), rose spring dwarf associated virus (RSDaV), rose yellow vein virus (RYVV), rose rosette virus (RRV), and blackberry chlorotic ringspot virus (BCRV); however, the results were negative. Therefore, the sample was subjected to high throughput sequencing (HTS). Briefly, TNA was depleted of rRNA and advanced for cDNA library preparation using TruSeq Stranded Total RNA kit (Illumina, USA). HTS was performed on Illumina NextSeq 500 platform. The raw reads were trimmed, de novo assembled, and subsequently were annotated using tBLASTx algorithm (Al Rwahnih et al. 2018). HTS generated 23.6 million 75 nucleotide (nt) single-end raw data reads. De novo assembly generated a contig (16,528 nts) resembling RLRaV reference sequence (KJ748003) with 74% identity at the nucleotide level. Putative coat protein and heat shock protein 70-like protein were identified based on >90% identity with RLRaV genes. To confirm HTS results, RT-PCR was performed using two primer sets, 1) Clo-F4916 (5’-GGTGTTCCAACGCTATCGTG-3’) and Clo-R5215 (5’- TGTCCTCAAACCGCCTACAT-3’) targeting nucleotide sequences of putative polyprotein 1a, and 2) Clo-F10006 (5’-GATTCCGCGGACGAATTAAT-3’) and Clo-R10311 (5’-GGTAACCGAAAGGTAAAGTATTC-3’) targeting nucleotide sequences of putative protein p25. The RLRaV amplicons with expected size of 300 nt were confirmed using bidirectional Sanger sequencing. The near-complete sequence of the new RLRaV isolate was deposited in GenBank under accession number MW056181. In addition, HTS analysis showed that RLRaV was in mixed infection with two mycoviruses (rose cryptic virus with 8,267 mapped reads and rose partitivirus with 7,283 mapped readss). To our knowledge, this is the first report of RLRaV affecting roses in California. Further research is needed to determine the prevalence of RLRaV in California as well as evaluation of RLRaV effect on rose performance.


Genetica ◽  
2018 ◽  
Vol 146 (3) ◽  
pp. 277-285
Author(s):  
Pan Ni ◽  
Ali Akbar Bhuiyan ◽  
Jian-Hai Chen ◽  
Jingjin Li ◽  
Cheng Zhang ◽  
...  

Author(s):  
Alex Schomaker-Bastos ◽  
Francisco Prosdocimi

Next-generation sequencing is now a mature technology, allowing partial animal genomes to be produced for many clades. Though many software exist for genome assembly and annotation, a simple pipeline that allows researchers to input raw sequencing reads in fastq format and allow the retrieval of a completely assembled and annotated mitochondrial genome is still missing. mitoMaker 1.0 is a pipeline developed in python that implements (i) recursive de novo assembly of mitochondrial genomes using a set of increasing k-mers; (ii) search for the best matching result to a target mitogenome and; (iii) performs iterative reference-based strategies to optimize the assembly. After (iv) checking for circularization and (v) positioning tRNA-Phe at the beginning, (vi) geneChecker.py module performs a complete annotation of the mitochondrial genome and provides a GenBank formatted file as output.


2018 ◽  
Vol 35 (12) ◽  
pp. 2066-2074 ◽  
Author(s):  
Yuansheng Liu ◽  
Zuguo Yu ◽  
Marcel E Dinger ◽  
Jinyan Li

Abstract Motivation Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix–prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20–80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. Availability and implementation https://github.com/yuansliu/minicom Supplementary information Supplementary data are available at Bioinformatics online.


PLoS ONE ◽  
2015 ◽  
Vol 10 (6) ◽  
pp. e0129154 ◽  
Author(s):  
Atsushi Fukushima ◽  
Michimi Nakamura ◽  
Hideyuki Suzuki ◽  
Kazuki Saito ◽  
Mami Yamazaki

2018 ◽  
Author(s):  
Tobias Andermann ◽  
Alexandre M. Fernandes ◽  
Urban Olsson ◽  
Mats Töpel ◽  
Bernard Pfeil ◽  
...  

AbstractAdvances in high-throughput sequencing techniques now allow relatively easy and affordable sequencing of large portions of the genome, even for non-model organisms. Many phylogenetic studies reduce costs by focusing their sequencing efforts on a selected set of targeted loci, commonly enriched using sequence capture. The advantage of this approach is that it recovers a consistent set of loci, each with high sequencing depth, which leads to more confidence in the assembly of target sequences. High sequencing depth can also be used to identify phylogenetically informative allelic variation within sequenced individuals, but allele sequences are infrequently assembled in phylogenetic studies.Instead, many scientists perform their phylogenetic analyses using contig sequences which result from the de novo assembly of sequencing reads into contigs containing only canonical nucleobases, and this may reduce both statistical power and phylogenetic accuracy. Here, we develop an easy-to-use pipeline to recover allele sequences from sequence capture data, and we use simulated and empirical data to demonstrate the utility of integrating these allele sequences to analyses performed under the Multispecies Coalescent (MSC) model. Our empirical analyses of Ultraconserved Element (UCE) locus data collected from the South American hummingbird genus Topaza demonstrate that phased allele sequences carry sufficient phylogenetic information to infer the genetic structure, lineage divergence, and biogeographic history of a genus that diversified during the last three million years. The phylogenetic results support the recognition of two species, and suggest a high rate of gene flow across large distances of rainforest habitats but rare admixture across the Amazon River. Our simulations provide evidence that analyzing allele sequences leads to more accurate estimates of tree topology and divergence times than the more common approach of using contig sequences.


Sign in / Sign up

Export Citation Format

Share Document