scholarly journals Compression of short-read sequences using path encoding

2014 ◽  
Author(s):  
Carl Kingsford ◽  
Rob Patro

Storing, transmitting, and archiving the amount of data produced by next generation sequencing is becoming a significant computational burden. For example, large-scale RNA-seq meta-analyses may now routinely process tens of terabytes of sequence. We present here an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale transcriptome sequencing. Our approach offers a new direction by sitting between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs --- a common task in genome assembly --- and context-dependent arithmetic coding. Supporting this method is a system, called a bit tree, to compactly store sets of kmers that is of independent interest. Using these techniques, we are able to encode RNA-seq reads using 3% -- 11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than recent competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

Author(s):  
Ting-Hsuan Wang ◽  
Cheng-Ching Huang ◽  
Jui-Hung Hung

Abstract Motivation Cross-sample comparisons or large-scale meta-analyses based on the next generation sequencing (NGS) involve replicable and universal data preprocessing, including removing adapter fragments in contaminated reads (i.e. adapter trimming). While modern adapter trimmers require users to provide candidate adapter sequences for each sample, which are sometimes unavailable or falsely documented in the repositories (such as GEO or SRA), large-scale meta-analyses are therefore jeopardized by suboptimal adapter trimming. Results Here we introduce a set of fast and accurate adapter detection and trimming algorithms that entail no a priori adapter sequences. These algorithms were implemented in modern C++ with SIMD and multithreading to accelerate its speed. Our experiments and benchmarks show that the implementation (i.e. EARRINGS), without being given any hint of adapter sequences, can reach comparable accuracy and higher throughput than that of existing adapter trimmers. EARRINGS is particularly useful in meta-analyses of a large batch of datasets and can be incorporated in any sequence analysis pipelines in all scales. Availability and implementation EARRINGS is open-source software and is available at https://github.com/jhhung/EARRINGS. Supplementary information Supplementary data are available at Bioinformatics online.


Gigabyte ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-7
Author(s):  
Bruno C. Genevcius ◽  
Tatiana T. Torres

Chinavia impicticornis is a neotropical stink bug of economic importance for various crops. Little is known about the development of the species, or the genetic mechanisms that may favor the establishment of populations in cultivated plants. Here, we conduct the first large-scale molecular study of C. impicticornis. Using tissues derived from the genitalia and the rest of the body for two immature stages of both males and females, we generated RNA-seq data, then assembled and functionally annotated a transcriptome. The de novo-assembled transcriptome contained around 400,000 contigs, with an average length of 688 bp. After pruning duplicated sequences and conducting a functional annotation, the final annotated transcriptome comprised 39,478 transcripts, of which 12,665 were assigned to Gene Ontology (GO) terms. These novel datasets will be invaluable for the discovery of molecular processes related to morphogenesis and immature biology. We hope to contribute to the growing body of research on stink bug evolution and development, as well as to the development of biorational pest management solutions.


2021 ◽  
Vol 17 (11) ◽  
pp. e1010036
Author(s):  
Paulo Vieira ◽  
Roxana Y. Myers ◽  
Clement Pellegrin ◽  
Catherine Wram ◽  
Cedar Hesse ◽  
...  

The burrowing nematode, Radopholus similis, is an economically important plant-parasitic nematode that inflicts damage and yield loss to a wide range of crops. This migratory endoparasite is widely distributed in warmer regions and causes extensive destruction to the root systems of important food crops (e.g., citrus, banana). Despite the economic importance of this nematode, little is known about the repertoire of effectors owned by this species. Here we combined spatially and temporally resolved next-generation sequencing datasets of R. similis to select a list of candidates for the identification of effector genes for this species. We confirmed spatial expression of transcripts of 30 new candidate effectors within the esophageal glands of R. similis by in situ hybridization, revealing a large number of pioneer genes specific to this nematode. We identify a gland promoter motif specifically associated with the subventral glands (named Rs-SUG box), a putative hallmark of spatial and concerted regulation of these effectors. Nematode transcriptome analyses confirmed the expression of these effectors during the interaction with the host, with a large number of pioneer genes being especially abundant. Our data revealed that R. similis holds a diverse and emergent repertoire of effectors, which has been shaped by various evolutionary events, including neofunctionalization, horizontal gene transfer, and possibly by de novo gene birth. In addition, we also report the first GH62 gene so far discovered for any metazoan and putatively acquired by lateral gene transfer from a bacterial donor. Considering the economic damage caused by R. similis, this information provides valuable data to elucidate the mode of parasitism of this nematode.


2016 ◽  
Author(s):  
Alan Medlar ◽  
Laura Laakso ◽  
Andreia Miraldo ◽  
Ari Löytynoja

AbstractHigh-throughput RNA-seq data has become ubiquitous in the study of non-model organisms, but its use in comparative analysis remains a challenge. Without a reference genome for mapping, sequence data has to be de novo assembled, producing large numbers of short, highly redundant contigs. Preparing these assemblies for comparative analyses requires the removal of redundant isoforms, assignment of orthologs and converting fragmented transcripts into gene alignments. In this article we present Glutton, a novel tool to process transcriptome assemblies for downstream evolutionary analyses. Glutton takes as input a set of fragmented, possibly erroneous transcriptome assemblies. Utilising phylogeny-aware alignment and reference data from a closely related species, it reconstructs one transcript per gene, finds orthologous sequences and produces accurate multiple alignments of coding sequences. We present a comprehensive analysis of Glutton’s performance across a wide range of divergence times between study and reference species. We demonstrate the impact choice of assembler has on both the number of alignments and the correctness of ortholog assignment and show substantial improvements over heuristic methods, without sacrificing correctness. Finally, using inference of Darwinian selection as an example of downstream analysis, we show that Glutton-processed RNA-seq data give results comparable to those obtained from full length gene sequences even with distantly related reference species. Glutton is available from http://wasabiapp.org/software/glutton/ and is licensed under the GPLv3.


2021 ◽  
Author(s):  
Fawaz Dabbaghie ◽  
Jana Ebler ◽  
Tobias Marschall

AbstractMotivationWith the fast development of third generation sequencing machines, de novo genome assembly is becoming a routine even for larger genomes. Graph-based representations of genomes arise both as part of the assembly process, but also in the context of pangenomes representing a population. In both cases, polymorphic loci lead to bubble structures in such graphs. Detecting bubbles is hence an important task when working with genomic variants in the context of genome graphs.ResultsHere, we present a fast general-purpose tool, called BubbleGun, for detecting bubbles and superbubbles in genome graphs. Furthermore, BubbleGun detects and outputs runs of linearly connected bubbles and superbubbles, which we call bubble chains. We showcase its utility on de Bruijn graphs and compare our results to vg’s snarl detection. We show that BubbleGun is considerably faster than vg especially in bigger graphs, where it reports all bubbles in less than 30 minutes on a human sample de Bruijn graph of around 2 million nodes.AvailabilityBubbleGun is available and documented at https://github.com/fawaz-dabbaghieh/bubble_gun under MIT [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online.


2021 ◽  
Vol 17 (7) ◽  
pp. e1009229
Author(s):  
Yuansheng Liu ◽  
Jinyan Li

Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 − 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.


2010 ◽  
Vol 9 (9) ◽  
pp. 1300-1310 ◽  
Author(s):  
Minou Nowrousian

ABSTRACT Over the past 5 years, large-scale sequencing has been revolutionized by the development of several so-called next-generation sequencing (NGS) technologies. These have drastically increased the number of bases obtained per sequencing run while at the same time decreasing the costs per base. Compared to Sanger sequencing, NGS technologies yield shorter read lengths; however, despite this drawback, they have greatly facilitated genome sequencing, first for prokaryotic genomes and within the last year also for eukaryotic ones. This advance was possible due to a concomitant development of software that allows the de novo assembly of draft genomes from large numbers of short reads. In addition, NGS can be used for metagenomics studies as well as for the detection of sequence variations within individual genomes, e.g., single-nucleotide polymorphisms (SNPs), insertions/deletions (indels), or structural variants. Furthermore, NGS technologies have quickly been adopted for other high-throughput studies that were previously performed mostly by hybridization-based methods like microarrays. This includes the use of NGS for transcriptomics (RNA-seq) or the genome-wide analysis of DNA/protein interactions (ChIP-seq). This review provides an overview of NGS technologies that are currently available and the bioinformatics analyses that are necessary to obtain information from the flood of sequencing data as well as applications of NGS to address biological questions in eukaryotic microorganisms.


Author(s):  
Robin Herbrechter ◽  
Nadine Hube ◽  
Raoul Buchholz ◽  
Andreas Reiner

AbstractIonotropic glutamate receptors (iGluRs) play key roles for signaling in the central nervous system. Alternative splicing and RNA editing are well-known mechanisms to increase iGluR diversity and to provide context-dependent regulation. Earlier work on isoform identification has focused on the analysis of cloned transcripts, mostly from rodents. We here set out to obtain a systematic overview of iGluR splicing and editing in human brain based on RNA-Seq data. Using data from two large-scale transcriptome studies, we established a workflow for the de novo identification and quantification of alternative splice and editing events. We detected all canonical iGluR splice junctions, assessed the abundance of alternative events described in the literature, and identified new splice events in AMPA, kainate, delta, and NMDA receptor subunits. Notable events include an abundant transcript encoding the GluA4 amino-terminal domain, GluA4-ATD, a novel C-terminal GluD1 (delta receptor 1) isoform, GluD1-b, and potentially new GluK4 and GluN2C isoforms. C-terminal GluN1 splicing may be controlled by inclusion of a cassette exon, which shows preference for one of the two acceptor sites in the last exon. Moreover, we identified alternative untranslated regions (UTRs) and species-specific differences in splicing. In contrast, editing in exonic iGluR regions appears to be mostly limited to ten previously described sites, two of which result in silent amino acid changes. Coupling of proximal editing/editing and editing/splice events occurs to variable degree. Overall, this analysis provides the first inventory of alternative splicing and editing in human brain iGluRs and provides the impetus for further transcriptome-based and functional investigations.


Nature ◽  
2020 ◽  
Vol 587 (7833) ◽  
pp. 246-251 ◽  
Author(s):  
Joel Armstrong ◽  
Glenn Hickey ◽  
Mark Diekhans ◽  
Ian T. Fiddes ◽  
Adam M. Novak ◽  
...  

AbstractNew genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1–3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


Sign in / Sign up

Export Citation Format

Share Document