scholarly journals FQSqueezer: k-mer-based compression of sequencing data

2019 ◽  
Author(s):  
Sebastian Deorowicz

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

2017 ◽  
Author(s):  
Vivian Link ◽  
Athanasios Kousathanas ◽  
Krishna Veeramah ◽  
Christian Sell ◽  
Amelie Scheu ◽  
...  

AbstractSummaryPost-mortem damage (PMD) obstructs the proper analysis of ancient DNA samples and can currently only be addressed by removing or down-weighting potentially damaged data. Here we present ATLAS, a suite of methods to accurately genotype and estimate genetic diversity from ancient samples, while accounting for PMD. It works directly from raw BAM files and enables the building of complete and customized pipelines for the analysis of ancient and other low-depth samples in a very user-friendly way. Based on simulations we show that, in the presence of PMD, a dedicated pipeline of ATLAS calls genotypes more accurately than the state-of-the-art pipeline of GATK combined with mapDamage 2.0.AvailabilityATLAS is an open-source C++ program freely available at https://bitbucket.org/phaentu/[email protected] informationSupplementary data are available at Bioinformatics online.


2021 ◽  
Vol 104 (2) ◽  
pp. 003685042110232
Author(s):  
Muhammad Sardaraz ◽  
Muhammad Tahir

Recent advancements in sequencing methods have led to significant increase in sequencing data. Increase in sequencing data leads to research challenges such as storage, transfer, processing, etc. data compression techniques have been opted to cope with the storage of these data. There have been good achievements in compression ratio and execution time. This fast-paced advancement has raised major concerns about the security of data. Confidentiality, integrity, authenticity of data needs to be ensured. This paper presents a novel lossless reference-free algorithm that focuses on data compression along with encryption to achieve security in addition to other parameters. The proposed algorithm uses preprocessing of data before applying general-purpose compression library. Genetic algorithm is used to encrypt the data. The technique is validated with experimental results on benchmark datasets. Comparative analysis with state-of-the-art techniques is presented. The results show that the proposed method achieves better results in comparison to existing methods.


2018 ◽  
Author(s):  
Brent S. Pedersen ◽  
Aaron R. Quinlan

AbstractMotivationExtracting biological insight from genomic data inevitably requires custom software. In many cases, this is accomplished with scripting languages, owing to their accessibility and brevity. Unfortunately, the ease of scripting languages typically comes at a substantial performance cost that is especially acute with the scale of modern genomics datasets.ResultsWe present hts-nim, a high-performance library written in the Nim programming language that provides a simple, scripting-like syntax without sacrificing performance.Availabilityhts-nim is available at https://github.com/brentp/hts-nim and the example tools are at https://github.com/brentp/hts-nim-tools both under the MIT [email protected] informationSupplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (7) ◽  
pp. 2275-2277 ◽  
Author(s):  
Jan Voges ◽  
Tom Paridaens ◽  
Fabian Müntefering ◽  
Liudmila S Mainzer ◽  
Brian Bliss ◽  
...  

Abstract Motivation In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. Results We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. Availability and implementation The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (16) ◽  
pp. 2880-2881 ◽  
Author(s):  
Dries Vaneechoutte ◽  
Klaas Vandepoele

Abstract Summary Public RNA-Sequencing (RNA-Seq) datasets are a valuable resource for transcriptome analyses, but their accessibility is hindered by the imperfect quality and presentation of their metadata and by the complexity of processing raw sequencing data. The Curse suite was created to alleviate these problems. It consists of an online curation tool named Curse to efficiently build compendia of experiments hosted on the Sequence Read Archive, and a lightweight pipeline named Prose to download and process the RNA-Seq data into expression atlases and co-expression networks. Curse networks showed improved linking of functionally related genes compared to the state-of-the-art. Availability and implementation Curse, Prose and their manuals are available at http://bioinformatics.psb.ugent.be/webtools/Curse/. Prose was implemented in Java. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Matteo Chiara ◽  
Federico Zambelli ◽  
Marco Antonio Tangaro ◽  
Pietro Mandreoli ◽  
David S Horner ◽  
...  

Abstract Summary While over 200 000 genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here, we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. Availabilityand implementation Galaxy   http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (16) ◽  
pp. 2843-2846 ◽  
Author(s):  
Hung Nguyen ◽  
Sangam Shrestha ◽  
Sorin Draghici ◽  
Tin Nguyen

Abstract Summary Since cancer is a heterogeneous disease, tumor subtyping is crucial for improved treatment and prognosis. We have developed a subtype discovery tool, called PINSPlus, that is: (i) robust against noise and unstable quantitative assays, (ii) able to integrate multiple types of omics data in a single analysis and (iii) dramatically superior to established approaches in identifying known subtypes and novel subgroups with significant survival differences. Our validation on 12,158 samples from 44 datasets shows that PINSPlus vastly outperforms other approaches. The software is easy-to-use and can partition hundreds of patients in a few minutes on a personal computer. Availability and implementation The package is available at https://cran.r-project.org/package=PINSPlus. Data and R script used in this manuscript are available at https://bioinformatics.cse.unr.edu/software/PINSPlus/. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (11) ◽  
pp. 3516-3521 ◽  
Author(s):  
Lixiang Zhang ◽  
Lin Lin ◽  
Jia Li

Abstract Motivation Cluster analysis is widely used to identify interesting subgroups in biomedical data. Since true class labels are unknown in the unsupervised setting, it is challenging to validate any cluster obtained computationally, an important problem barely addressed by the research community. Results We have developed a toolkit called covering point set (CPS) analysis to quantify uncertainty at the levels of individual clusters and overall partitions. Functions have been developed to effectively visualize the inherent variation in any cluster for data of high dimension, and provide more comprehensive view on potentially interesting subgroups in the data. Applying to three usage scenarios for biomedical data, we demonstrate that CPS analysis is more effective for evaluating uncertainty of clusters comparing to state-of-the-art measurements. We also showcase how to use CPS analysis to select data generation technologies or visualization methods. Availability and implementation The method is implemented in an R package called OTclust, available on CRAN. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (7) ◽  
pp. 2291-2292 ◽  
Author(s):  
Saskia Freytag ◽  
Ryan Lister

Abstract Summary Due to the scale and sparsity of single-cell RNA-sequencing data, traditional plots can obscure vital information. Our R package schex overcomes this by implementing hexagonal binning, which has the additional advantages of improving speed and reducing storage for resulting plots. Availability and implementation schex is freely available from Bioconductor via http://bioconductor.org/packages/release/bioc/html/schex.html and its development version can be accessed on GitHub via https://github.com/SaskiaFreytag/schex. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (18) ◽  
pp. 3527-3529 ◽  
Author(s):  
David Aparício ◽  
Pedro Ribeiro ◽  
Tijana Milenković ◽  
Fernando Silva

Abstract Motivation Network alignment (NA) finds conserved regions between two networks. NA methods optimize node conservation (NC) and edge conservation. Dynamic graphlet degree vectors are a state-of-the-art dynamic NC measure, used within the fastest and most accurate NA method for temporal networks: DynaWAVE. Here, we use graphlet-orbit transitions (GoTs), a different graphlet-based measure of temporal node similarity, as a new dynamic NC measure within DynaWAVE, resulting in GoT-WAVE. Results On synthetic networks, GoT-WAVE improves DynaWAVE’s accuracy by 30% and speed by 64%. On real networks, when optimizing only dynamic NC, the methods are complementary. Furthermore, only GoT-WAVE supports directed edges. Hence, GoT-WAVE is a promising new temporal NA algorithm, which efficiently optimizes dynamic NC. We provide a user-friendly user interface and source code for GoT-WAVE. Availability and implementation http://www.dcc.fc.up.pt/got-wave/ Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document