scholarly journals Reducing the number of artifactual repeats in de novo assembly of RNA-Seq data by optimizing the assembly pipeline

Gene Reports ◽  
2017 ◽  
Vol 9 ◽  
pp. 7-12
Author(s):  
Wei-Kang Lee ◽  
Nur Afiza Mohd Zainuddin ◽  
Hui-Ying Teh ◽  
Yi-Yi Lim ◽  
Mohd Uzair Jaafar ◽  
...  
PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3702 ◽  
Author(s):  
Santiago Montero-Mendieta ◽  
Manfred Grabherr ◽  
Henrik Lantz ◽  
Ignacio De la Riva ◽  
Jennifer A. Leonard ◽  
...  

Whole genome sequencing (WGS) is a very valuable resource to understand the evolutionary history of poorly known species. However, in organisms with large genomes, as most amphibians, WGS is still excessively challenging and transcriptome sequencing (RNA-seq) represents a cost-effective tool to explore genome-wide variability. Non-model organisms do not usually have a reference genome and the transcriptome must be assembledde-novo. We used RNA-seq to obtain the transcriptomic profile forOreobates cruralis, a poorly known South American direct-developing frog. In total, 550,871 transcripts were assembled, corresponding to 422,999 putative genes. Of those, we identified 23,500, 37,349, 38,120 and 45,885 genes present in the Pfam, EggNOG, KEGG and GO databases, respectively. Interestingly, our results suggested that genes related to immune system and defense mechanisms are abundant in the transcriptome ofO. cruralis. We also present a pipeline to assist with pre-processing, assembling, evaluating and functionally annotating ade-novotranscriptome from RNA-seq data of non-model organisms. Our pipeline guides the inexperienced user in an intuitive way through all the necessary steps to buildde-novotranscriptome assemblies using readily available software and is freely available at:https://github.com/biomendi/TRANSCRIPTOME-ASSEMBLY-PIPELINE/wiki.


2018 ◽  
Author(s):  
Simon Roux ◽  
Gareth Trubl ◽  
Danielle Goudeau ◽  
Nandita Nath ◽  
Estelle Couradeau ◽  
...  

Background. Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enabling de novo assembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes. Methods. Here we evaluate de novo assembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes. Results. Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCR cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥ 10kb by 10 to 100-fold for low input metagenomes. Conclusions. PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improved de novo genome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Wei Zhou ◽  
Qi Chen ◽  
Xiao-Bing Wang ◽  
Tyler O. Hughes ◽  
Jian-Jun Liu ◽  
...  

An amendment to this paper has been published and can be accessed via a link at the top of the paper.


BMC Genomics ◽  
2014 ◽  
Vol 15 (1) ◽  
pp. 453 ◽  
Author(s):  
Steven A Yates ◽  
Martin T Swain ◽  
Matthew J Hegarty ◽  
Igor Chernukin ◽  
Matthew Lowe ◽  
...  

2017 ◽  
Vol 18 (1) ◽  
Author(s):  
You-Yu Lin ◽  
Chia-Hung Hsieh ◽  
Jiun-Hong Chen ◽  
Xuemei Lu ◽  
Jia-Horng Kao ◽  
...  

BMC Genomics ◽  
2015 ◽  
Vol 16 (1) ◽  
Author(s):  
Shimna Sudheesh ◽  
Timothy I. Sawbridge ◽  
Noel OI Cogan ◽  
Peter Kennedy ◽  
John W. Forster ◽  
...  
Keyword(s):  
De Novo ◽  

2016 ◽  
Author(s):  
Cédric Cabau ◽  
Frédéric Escudié ◽  
Anis Djari ◽  
Yann Guiguen ◽  
Julien Bobe ◽  
...  

Background De novo transcriptome assembly of short reads is now a common step in expression analysis of organisms lacking a reference genome sequence. Several software packages are available to perform this task. Even if their results are of good quality it is still possible to improve them in several ways including redundancy reduction or error correction. Trinity and Oases are two commonly used de novo transcriptome assemblers. The contig sets they produce are of good quality. Still, their compaction (number of contigs needed to represent the transcriptome) and their quality (chimera and nucleotide error rates) can be improved. Results We built a de novo RNA-Seq Assembly Pipeline (DRAP) which wraps these two assemblers (Trinity and Oases) in order to improve their results regarding the above-mentioned criteria. DRAP reduces from 1,3 to 15 fold the number of resulting contigs of the assemblies depending on the read set and the assembler used. This article presents seven assembly comparisons showing in some cases drastic improvements when using DRAP. DRAP does not significantly impair assembly quality metrics such are read realignment rate or protein reconstruction counts. Conclusion Transcriptome assembly is a challenging computational task even if good solutions are already available to end-users, these solutions can still be improved while conserving the overall representation and quality of the assembly. The de novo RNA-Seq Assembly Pipeline (DRAP) is an ease to use software package to produce compact and corrected transcript set. DRAP is free, open-source and available at http://www.sigenae.org/drap .


Blood ◽  
2016 ◽  
Vol 128 (22) ◽  
pp. 195-195
Author(s):  
David Mosen-Ansorena ◽  
Rachael Bashford-Rogers ◽  
Niccolo Bolli ◽  
Stephane Minvielle ◽  
Florence Magrangeas ◽  
...  

Abstract Introduction Although monoclonal immunoglobulin (Ig) production by myeloma cells is one of the central features of the disease, genotypic identification of the clonal Ig sequence remains understudied in multiple myeloma (MM). Here, using extensive RNA-seq data, we study molecular features of clonal Ig rearrangements, as well as their association with other MM markers and patient outcome. Methods We performed deep RNA-seq on purified CD138+ MM cells from 429 newly-diagnosed uniformly-treated patients with long clinical follow-up. For each sample, we performed de novo assembly using sequences that appeared in the library with a frequency of at least one in a million. Germline V and J genes were then BLASTed against the assembled contigs to determine the clonal germline genes and pinpoint mutations. Using the sequences reconstructed from the Ig contigs and the BLAST output, we ran IgBLAST to fully characterize the predominant Ig V(D)J sequence. Results We tested the accuracy of our approach by looking at 24 technical duplicates and one triplicate. In all cases, the predicted gene and gene allele were consistent across replicates. Next, we evaluated our large patient cohort, identifying IGHV3 as the most common clonal VH gene subgroup (53.3%), followed by IGHV4 (17.8%) and IGHV1 (15.6%). Importantly, we observed a significant association between poorer prognosis and IGHV3, both for progression-free survival (PFS) (p=0.0019) and overall survival (OS) (p=0.012). IGHV3-30 (11%, the most commonly rearranged VH gene) and IGHV3-9 (4.8%) were the drivers behind this poor prognosis (IGHV3-30: PFS p=0.021; OS p=0.013) (IGHV3-9: PFS p=0.002). IGHV3-30 was even more preferentially rearranged than in normal B-cell VH repertoires from previous studies (8.5%, 6.3%) and ours (2%). Remarkably, these results sharply contrast with what has been observed in CLL. In this malignancy, IGHV3-30 use has been seen to be underrepresented and usually characterizes an indolent clinical course, while IGHV3-21 and possibly IGHV3-23 carry poor prognosis. We predicted light chain usage through the presence of clonal VL sequences. The most frequent VL genes were from the κ locus (69.4% total): IGKV1-33 (12.4%), IGKV1-5 (11.3%), IGKV3-20 (9.9%) and IGKV1-39 (8.0%). Del(22q) was observed more frequently in patients with IGλ (OR=10.0, p=6e-15) and, within this group, del(22q) was more frequent if Vλ belonged to the more centromeric V-clusters C or B, in contrast to cluster A (OR=8.4, p=5e-4). Remarkably, patients with Vλ gene from cluster A presented worse OS (vs. Vk: p=0.0079; vs. Vλ B,C: p=0.067). The proportion of mutated bases was higher in the heavy chain than in the light chain (mean 7.0% vs. 4.8%, max 14.6% vs. 14.3%), and it was associated with OS (heavy p=0.0020, light p=0.036, both=0.0056), but not PFS. Interestingly, mutated Ig in CLL results in a more benign clinical course. We further found that 24.9% and 22.7% of the mutations lay within WRCY or RGYW AID motifs in the light and heavy chains respectively (enrichment p<1e-16), while AID mutations in a TW or WA context accounted for 22.9% and 25.7% (p=0.14, p=0.64). Higher ratios of mutations in WRCY vs. RGYW motifs within the light chain were highly predictive of poor prognosis (PFS p=0.0019, OS p=6.3e-4). Strikingly, IGλ usage was linked to higher ratios (p=3e-6), an association not explained by germline sequence variability (p=0.24). The usage of IGHV3 genes and the AID WRCY/RGYW motif ratio were independent markers of each other (p=1) and of other markers of poor prognosis in MM, such as presence of either t(4;14) or del(17p) (IGHV3 p=0.10; motif ratio p=0.49). In conclusion, de novo Ig heavy and light chain assembly using RNA-seq identifies interesting biology, may provide MM markers and highlights a novel application of high-throughput genomics. Disclosures Anderson: OncoPep Inc.: Equity Ownership, Membership on an entity's Board of Directors or advisory committees. Avet-Loiseau:sanofi: Consultancy; celgene: Consultancy; amgen: Consultancy; janssen: Consultancy.


Sign in / Sign up

Export Citation Format

Share Document