FastqCLS: a FASTQ Compressor for Long-read Sequencing via read reordering using a novel scoring model

Author(s):  
Dohyoen Lee ◽  
Giltae Song

Abstract Motivation Over the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only. Results We designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data. Availability and implementation FastqCLS can be downloaded from https://github.com/krlucete/FastqCLS Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Givanna H Putri ◽  
Irena Koprinska ◽  
Thomas M Ashhurst ◽  
Nicholas J C King ◽  
Mark N Read

Abstract Motivation Many ‘automated gating’ algorithms now exist to cluster cytometry and single-cell sequencing data into discrete populations. Comparative algorithm evaluations on benchmark datasets rely either on a single performance metric, or a few metrics considered independently of one another. However, single metrics emphasize different aspects of clustering performance and do not rank clustering solutions in the same order. This underlies the lack of consensus between comparative studies regarding optimal clustering algorithms and undermines the translatability of results onto other non-benchmark datasets. Results We propose the Pareto fronts framework as an integrative evaluation protocol, wherein individual metrics are instead leveraged as complementary perspectives. Judged superior are algorithms that provide the best trade-off between the multiple metrics considered simultaneously. This yields a more comprehensive and complete view of clustering performance. Moreover, by broadly and systematically sampling algorithm parameter values using the Latin Hypercube sampling method, our evaluation protocol minimizes (un)fortunate parameter value selections as confounding factors. Furthermore, it reveals how meticulously each algorithm must be tuned in order to obtain good results, vital knowledge for users with novel data. We exemplify the protocol by conducting a comparative study between three clustering algorithms (ChronoClust, FlowSOM and Phenograph) using four common performance metrics applied across four cytometry benchmark datasets. To our knowledge, this is the first time Pareto fronts have been used to evaluate the performance of clustering algorithms in any application domain. Availability and implementation Implementation of our Pareto front methodology and all scripts and datasets to reproduce this article are available at https://github.com/ghar1821/ParetoBench. Supplementary information Supplementary data are available at Bioinformatics online.


Plants ◽  
2019 ◽  
Vol 8 (8) ◽  
pp. 270 ◽  
Author(s):  
Yun Gyeong Lee ◽  
Sang Chul Choi ◽  
Yuna Kang ◽  
Kyeong Min Kim ◽  
Chon-Sik Kang ◽  
...  

The whole genome sequencing (WGS) has become a crucial tool in understanding genome structure and genetic variation. The MinION sequencing of Oxford Nanopore Technologies (ONT) is an excellent approach for performing WGS and it has advantages in comparison with other Next-Generation Sequencing (NGS): It is relatively inexpensive, portable, has simple library preparation, can be monitored in real-time, and has no theoretical limits on reading length. Sorghum bicolor (L.) Moench is diploid (2n = 2x = 20) with a genome size of about 730 Mb, and its genome sequence information is released in the Phytozome database. Therefore, sorghum can be used as a good reference. However, plant species have complex and large genomes when compared to animals or microorganisms. As a result, complete genome sequencing is difficult for plant species. MinION sequencing that produces long-reads can be an excellent tool for overcoming the weak assembly of short-reads generated from NGS by minimizing the generation of gaps or covering the repetitive sequence that appears on the plant genome. Here, we conducted the genome sequencing for S. bicolor cv. BTx623 while using the MinION platform and obtained 895,678 reads and 17.9 gigabytes (Gb) (ca. 25× coverage of reference) from long-read sequence data. A total of 6124 contigs (covering 45.9%) were generated from Canu, and a total of 2661 contigs (covering 50%) were generated from Minimap and Miniasm with a Racon through a de novo assembly using two different tools and mapped assembled contigs against the sorghum reference genome. Our results provide an optimal series of long-read sequencing analysis for plant species while using the MinION platform and a clue to determine the total sequencing scale for optimal coverage that is based on various genome sizes.


2017 ◽  
Author(s):  
Tslil Gabrieli ◽  
Hila Sharim ◽  
Yael Michaeli ◽  
Yuval Ebenstein

ABSTRACTVariations in the genetic code, from single point mutations to large structural or copy number alterations, influence susceptibility, onset, and progression of genetic diseases and tumor transformation. Next-generation sequencing analysis is unable to reliably capture aberrations larger than the typical sequencing read length of several hundred bases. Long-read, single-molecule sequencing methods such as SMRT and nanopore sequencing can address larger variations, but require costly whole genome analysis. Here we describe a method for isolation and enrichment of a large genomic region of interest for targeted analysis based on Cas9 excision of two sites flanking the target region and isolation of the excised DNA segment by pulsed field gel electrophoresis. The isolated target remains intact and is ideally suited for optical genome mapping and long-read sequencing at high coverage. In addition, analysis is performed directly on native genomic DNA that retains genetic and epigenetic composition without amplification bias. This method enables detection of mutations and structural variants as well as detailed analysis by generation of hybrid scaffolds composed of optical maps and sequencing data at a fraction of the cost of whole genome sequencing.


2018 ◽  
Author(s):  
Koen Van Den Berge ◽  
Katharina Hembach ◽  
Charlotte Soneson ◽  
Simone Tiberi ◽  
Lieven Clement ◽  
...  

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.


2021 ◽  
Author(s):  
Ankita Narang ◽  
Paul Lacaze ◽  
Kathlyn Ronaldson ◽  
John McNeil ◽  
Mahesh Jayaram ◽  
...  

One of the concerns limiting the use of clozapine in schizophrenia treatment is the risk of rare but potentially fatal myocarditis. Our previous genome-wide association study and human leucocyte antigen analyses identified putative loci associated with clozapine-induced myocarditis. However, the contribution of DNA variation in cytochrome P450 genes, copy number variants and rare deleterious variants have not been investigated. We explored these unexplored classes of DNA variation using whole-genome sequencing data from 25 cases with clozapine-induced myocarditis and 25 demographically-matched clozapine-tolerant control subjects. We identified 15 genes based on rare variant gene-burden analysis (MLLT6, CADPS, TACC2, L3MBTL4, NPY, SLC25A21, PARVB, GPR179, ACAD9, NOL8, C5orf33, FAM127A, AFDN, SLC6A11, PXDN) nominally associated (p<0.05) with clozapine-induced myocarditis. Of these genes, 13 were expressed in human myocardial tissue. Although independent replication of these findings is required, our study provides preliminary insights into the potential role of rare genetic variants in susceptibility to clozapine-induced myocarditis.


2019 ◽  
Vol 35 (22) ◽  
pp. 4809-4811 ◽  
Author(s):  
Robert S Harris ◽  
Monika Cechova ◽  
Kateryna D Makova

Abstract Summary Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response. Availability and implementation NCRF is implemented in C, supported by several python scripts, and is available in bioconda and at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Wouter De Coster ◽  
Svenn D’Hert ◽  
Darrin T. Schultz ◽  
Marc Cruts ◽  
Christine Van Broeckhoven

AbstractSummary: Here we describe NanoPack, a set of tools developed for visualization and processing of long read sequencing data from Oxford Nanopore Technologies and Pacific Biosciences.Availability and Implementation: The NanoPack tools are written in Python3 and released under the GNU GPL3.0 Licence. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools.Contact:[email protected] information: Supplementary tables and figures are available at Bioinformatics online.


2020 ◽  
Author(s):  
Zarina Mohd Zawawi ◽  
Jeyanthi Suppiah ◽  
Jeevanathan Kalyanasundram ◽  
Muhammad Afif Azizan ◽  
Shuhaila Mat-Sharani ◽  
...  

Abstract Background: Since December 2019, the outbreak of COVID-19 has raised a great public health concern globally. Here, we report the whole genome sequencing analysis of SARS-CoV-2 strains in Malaysia isolated from six patients diagnosed with COVID-19.Methods: The SARS-CoV-2 viral RNA extracted from clinical specimens and isolates were subjected to whole genome sequencing using NextSeq 500 platform. The sequencing data were assembled to full genome sequences using Megahit and phylogenetic tree was constructed using Mega X software.Results: Six full genome sequences of SARS-CoV-2 comprising of strains from 1st wave (25th January 2020) and 2nd wave (27th February 2020) infection were obtained. Downstream analysis demonstrated diversity among the Malaysian strains with several synonymous and non-synonymous mutations in four of the six cases, affecting the genes M, orf1ab, and S of the SARS-CoV-2 virus. The phylogenetic analysis revealed viral genome sequences of Malaysian SARS-CoV-2 strains clustered under the ancestral Type B.Conclusion: This study comprehended the SARS-CoV-2 virus evolution during its circulation in Malaysia. Continuous monitoring and analysis of the whole genome sequences of confirmed cases would be crucial to further understand the genetic evolution of the virus.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Readman Chiu ◽  
Indhu-Shree Rajan-Babu ◽  
Jan M. Friedman ◽  
Inanc Birol

AbstractTandem repeat (TR) expansion is the underlying cause of over 40 neurological disorders. Long-read sequencing offers an exciting avenue over conventional technologies for detecting TR expansions. Here, we present Straglr, a robust software tool for both targeted genotyping and novel expansion detection from long-read alignments. We benchmark Straglr using various simulations, targeted genotyping data of cell lines carrying expansions of known diseases, and whole genome sequencing data with chromosome-scale assembly. Our results suggest that Straglr may be useful for investigating disease-associated TR expansions using long-read sequencing.


2013 ◽  
Vol 59 (1) ◽  
pp. 127-137 ◽  
Author(s):  
Nardin Samuel ◽  
Thomas J Hudson

BACKGROUND Sequencing of cancer genomes has become a pivotal method for uncovering and understanding the deregulated cellular processes driving tumor initiation and progression. Whole-genome sequencing is evolving toward becoming less costly and more feasible on a large scale; consequently, thousands of tumors are being analyzed with these technologies. Interpreting these data in the context of tumor complexity poses a challenge for cancer genomics. CONTENT The sequencing of large numbers of tumors has revealed novel insights into oncogenic mechanisms. In particular, we highlight the remarkable insight into the pathogenesis of breast cancers that has been gained through comprehensive and integrated sequencing analysis. The analysis and interpretation of sequencing data, however, must be considered in the context of heterogeneity within and among tumor samples. Only by adequately accounting for the underlying complexity of cancer genomes will the potential of genome sequencing be understood and subsequently translated into improved management of patients. SUMMARY The paradigm of personalized medicine holds promise if patient tumors are thoroughly studied as unique and heterogeneous entities and clinical decisions are made accordingly. Associated challenges will be ameliorated by continued collaborative efforts among research centers that coordinate the sharing of mutation, intervention, and outcomes data to assist in the interpretation of genomic data and to support clinical decision-making.


Sign in / Sign up

Export Citation Format

Share Document