scholarly journals ChIPWig: A Random Access-Enabling Lossless and Lossy Compression Method for ChIP-seq Data

2017 ◽  
Author(s):  
Vida Ravanmehr ◽  
Minji Kim ◽  
Zhiying Wang ◽  
Olgica Milenković

AbstractMotivationThe past decade has witnessed a rapid development of data acquisition technologies that enable integrative genomic and proteomic analysis. One such technology is chromatin immunoprecipitation sequencing (ChIP-seq), developed for analyzing interactions between proteins and DNA via next-generation sequencing technologies. As ChIP-seq experiments are inexpensive and time-efficient, massive datasets from this domain have been acquired, introducing significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a state-of-the-art lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. Wig is a standard file format, which in this setting contains relevant read density information crucial for visualization and downstream processing. ChIPWig may be executed in two different modes: lossless and lossy. Lossless ChIPWig compression allows for random access and fast queries in the file through careful variable-length block-wise encoding. ChIPWig also stores the summary statistics of each block needed for guided access. Lossy ChIPWig, in contrast, performs quantization of the read density values before feeding them into the lossless ChIPWig compressor. Nonuniform lossy quantization leads to further reductions in the file size, while maintaining the same accuracy of the ChIP-seq peak calling and motif discovery pipeline based on the NarrowPeaks method tailor-made for Wig files. The compressors are designed using new statistical modeling approaches coupled with delta and arithmetic encoding.ResultsWe tested the ChIPWig compressor on a number of ChIP-seq datasets generated by the ENCODE project. Lossless ChIPWig reduces the file sizes to merely 6% of the original, and offers an average 6-fold compression rate improvement compared to bigWig. The running times for compression and decompression are comparable to those of bigWig. The compression and decompression speed rates are of the order of 0.2 MB/sec using general purpose computers. ChIPWig with random access only slightly degrades the performance and running time when compared to the standard mode. In the lossy mode, the average file sizes reduce by 2-fold compared to the lossless mode. Most importantly, near-optimal nonuniform quantization with respect to mean-square distortion does not affect peak calling and motif discovery results on the data tested.Availability and ImplementationSource code and binaries freely available for download at https://github.com/vidarmehr/[email protected] informationIs available on bioRxiv.

2018 ◽  
Vol 35 (15) ◽  
pp. 2674-2676 ◽  
Author(s):  
Shubham Chandak ◽  
Kedar Tatwawadi ◽  
Idoia Ochoa ◽  
Mikel Hernaez ◽  
Tsachy Weissman

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Xin Bai ◽  
Jie Ren ◽  
Yingying Fan ◽  
Fengzhu Sun

Abstract Motivation The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance. Results To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini–Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. Availabilityand implementation Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (18) ◽  
pp. 3484-3486 ◽  
Author(s):  
Tao Jiang ◽  
Bo Liu ◽  
Junyi Li ◽  
Yadong Wang

Abstract Summary Mobile element insertion (MEI) is a major category of structure variations (SVs). The rapid development of long read sequencing technologies provides the opportunity to detect MEIs sensitively. However, the signals of MEI implied by noisy long reads are highly complex due to the repetitiveness of mobile elements as well as the high sequencing error rates. Herein, we propose the Realignment-based Mobile Element insertion detection Tool for Long read (rMETL). Benchmarking results of simulated and real datasets demonstrate that rMETL enables to handle the complex signals to discover MEIs sensitively. It is suited to produce high-quality MEI callsets in many genomics studies. Availability and implementation rMETL is available from https://github.com/hitbc/rMETL. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Michael Menzel ◽  
Sabine Hurka ◽  
Stefan Glasenhardt ◽  
Andreas Gogol-Döring

Abstract Motivation The discovery of sequence motifs mediating DNA-protein binding usually implies the determination of binding sites using high-throughput sequencing and peak calling. The determination of peaks, however, depends strongly on data quality and is susceptible to noise. Results Here, we present a novel approach to reliably identify transcription factor-binding motifs from ChIP-Seq data without peak detection. By evaluating the distributions of sequencing reads around the different k-mers in the genome, we are able to identify binding motifs in ChIP-Seq data that yield no results in traditional pipelines. Availability and implementation NoPeak is published under the GNU General Public License and available as a standalone console-based Java application at https://github.com/menzel/nopeak. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Minhyeok Cho ◽  
Albert No

Abstract Background Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. Results This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. Conclusion The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.


2017 ◽  
Vol 34 (6) ◽  
pp. 911-919 ◽  
Author(s):  
Vida Ravanmehr ◽  
Minji Kim ◽  
Zhiying Wang ◽  
Olgica Milenković

Author(s):  
Felix Hanau ◽  
Hannes Röst ◽  
Idoia Ochoa

Abstract Motivation Mass spectrometry data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for Mass Spectrometry (MS) data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio. Results We tested mspack on several datasets generated by commonly used mass spectrometry instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared to the original files. Lossless mspack achieves 10 - 60% lower file sizes than MassComp, and lossy mspack compresses 36 - 60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time. Availability mspack is implemented in C ++ and freely available at https://github.com/fhanau/mspack under the Apache license. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (12) ◽  
pp. 3669-3679 ◽  
Author(s):  
Can Firtina ◽  
Jeremie S Kim ◽  
Mohammed Alser ◽  
Damla Senol Cali ◽  
A Ercument Cicek ◽  
...  

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 9 (1) ◽  
Author(s):  
Nathan Tessema Ersumo ◽  
Cem Yalcin ◽  
Nick Antipa ◽  
Nicolas Pégard ◽  
Laura Waller ◽  
...  

Abstract Dynamic axial focusing functionality has recently experienced widespread incorporation in microscopy, augmented/virtual reality (AR/VR), adaptive optics and material processing. However, the limitations of existing varifocal tools continue to beset the performance capabilities and operating overhead of the optical systems that mobilize such functionality. The varifocal tools that are the least burdensome to operate (e.g. liquid crystal, elastomeric or optofluidic lenses) suffer from low (≈100 Hz) refresh rates. Conversely, the fastest devices sacrifice either critical capabilities such as their dwelling capacity (e.g. acoustic gradient lenses or monolithic micromechanical mirrors) or low operating overhead (e.g. deformable mirrors). Here, we present a general-purpose random-access axial focusing device that bridges these previously conflicting features of high speed, dwelling capacity and lightweight drive by employing low-rigidity micromirrors that exploit the robustness of defocusing phase profiles. Geometrically, the device consists of an 8.2 mm diameter array of piston-motion and 48-μm-pitch micromirror pixels that provide 2π phase shifting for wavelengths shorter than 1100 nm with 10–90% settling in 64.8 μs (i.e., 15.44 kHz refresh rate). The pixels are electrically partitioned into 32 rings for a driving scheme that enables phase-wrapped operation with circular symmetry and requires <30 V per channel. Optical experiments demonstrated the array’s wide focusing range with a measured ability to target 29 distinct resolvable depth planes. Overall, the features of the proposed array offer the potential for compact, straightforward methods of tackling bottlenecked applications, including high-throughput single-cell targeting in neurobiology and the delivery of dense 3D visual information in AR/VR.


2012 ◽  
Vol 268-270 ◽  
pp. 1416-1421
Author(s):  
Yu Hui Zhang ◽  
Li Wen Guan ◽  
Li Ping Wang ◽  
Yong Zhi Hua

The forward kinematics analysis of parallel manipulator is a difficult issue, which has been studied by many researchers recently. In this paper, in order to solve the difficult issue, a new computing method with higher calculation accuracy, good operation steadiness and faster speed is mentioned. Firstly, the mathematical model of direct kinematics of the Stewart platform is founded, which is nonlinear equations. Secondly, with the rapid development of artificial intelligence technology, Memetic algorithms (MA) are applied to solve the systems of nonlinear equations more and more, replacing the traditional algorithms. MA is a kind of meta-heuristic algorithm combined genetic algorithms (GA) with local search at the end of iteration. Finally, the validity of this algorithm has been testified by simulating iteration operation. The numerical simulation shows that MA can surely and rapidly get global optimum solution and greatly improve convergence rate. Thereby, MA can be widely used as a general-purpose algorithm for solving the forward kinematics of parallel mechanism.


Sign in / Sign up

Export Citation Format

Share Document