scholarly journals “GPress: a framework for querying General Feature Format (GFF) files and feature expression files in a compressed form”

2019 ◽  
Author(s):  
Qingxi Meng ◽  
Idoia Ochoa ◽  
Mikel Hernaez

1Abstract1.1MotivationSequencing data are often summarized at different annotation levels for further analysis. The general feature format (GFF) and its descendants, the gene transfer format (GTF) and GFF3, are the most commonly used data formats for genomic annotations. These files are extensively updated, queried and shared, and hence as the number of generated GFF files increases, efficient data storage and retrieval are becoming increasingly important. Existing GFF utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. Hence, we propose GPress, a framework for querying GFF files in a compressed form. In addition, GPress can also incorporate and compress feature expression files, supporting simultaneous queries on both files.1.2ResultsWe tested GPress on several GFF files of different organisms, and showed that it achieves on average a 98% reduction in size, while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds. For example, on a Human GFF file, GPress can find all items with a unique identifier in 2.47 seconds and all items with coordinates within the range of 1,000 to 100,000 in 4.61 seconds. In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce the size of the expression file by more than 92%, while still retrieving the information within seconds. GPress is freely available at https://github.com/qm2/gpress.

2020 ◽  
Vol 36 (18) ◽  
pp. 4810-4812
Author(s):  
Qingxi Meng ◽  
Idoia Ochoa ◽  
Mikel Hernaez

Abstract Motivation Sequencing data are often summarized at different annotation levels for further analysis, generally using the general feature format (GFF) or its descendants, gene transfer format (GTF) and GFF3. Existing utilities for accessing these files, like gffutils and gffread, do not focus on reducing the storage space, significantly increasing it in some cases. We propose GPress, a framework for querying GFF files in a compressed form. GPress can also incorporate and compress expression files from both bulk and single-cell RNA-Seq experiments, supporting simultaneous queries on both the GFF and expression files. In brief, GPress applies transformations to the data which are then compressed with the general lossless compressor BSC. To support queries, GPress compresses the data in blocks and creates several index tables for fast retrieval. Results We tested GPress on several GFF files of different organisms, and showed that it achieves on average a 61% reduction in size with respect to gzip (the current de facto compressor for GFF files) while being able to retrieve all annotations for a given identifier or a range of coordinates in a few seconds (when run in a common laptop). In contrast, gffutils provides faster retrieval but doubles the size of the GFF files. When additionally linking an expression file, we show that GPress can reduce its size by more than 68% when compared to gzip (for both bulk and single-cell RNA-Seq experiments), while still retrieving the information within seconds. Finally, applying BSC to the data streams generated by GPress instead of to the original file shows a size reduction of more than 44% on average. Availability and implementation GPress is freely available at https://github.com/qm2/gpress. Supplementary information Supplementary data are available at Bioinformatics online.


1993 ◽  
Vol 30 (3) ◽  
pp. 330-332 ◽  
Author(s):  
Stephane Schwartz ◽  
Jon T. Kapala ◽  
Harry Rajchgot ◽  
Gordon L. Roberts

Inaccuracies exist in identifying and recording various types of oral cleft. One of the major reasons for this problem is that no efficient or universally accepted recording system presently exists. The RPL system introduced in this article provides an accurate and systematic numerical recording for the identification of various types of lip and maxillary clefts. The simplicity of the system allows quick and efficient data storage and retrieval.


Manual segmentation in the brain tumors analyses for malignancy prognosis, via massive amount MRI images produced through medical routine, frustrating task and is a hard. There is a dependence on automated brain tumor graphic segmentation. The amount of precision necessary for scientific purposes is normally as yet not known, and so can't be conveniently quantified actually by means of professional physicians. That is a fascinating point, which includes just sparsely been resolved in the literature, but is nonetheless truly relevant up to now. Additionally, storage space automatization for medical images is essential need nowadays. To carry out very quickly analysis as well as, prognosis there's an imperative want of automated photo storage. Hence, this paper focused on development of new algorithm called “EasyGet” for automatic data storage and retrieval using Hadoop architecture


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5611 ◽  
Author(s):  
Rongjie Wang ◽  
Junyi Li ◽  
Yang Bai ◽  
Tianyi Zang ◽  
Yadong Wang

Dramatic increases in data produced by next-generation sequencing (NGS) technologies demand data compression tools for saving storage space. However, effective and efficient data compression for genome sequencing data has remained an unresolved challenge in NGS data studies. In this paper, we propose a novel alignment-free and reference-free compression method, BdBG, which is the first to compress genome sequencing data with dynamic de Bruijn graphs based on the data after bucketing. Compared with existing de Bruijn graph methods, BdBG only stored a list of bucket indexes and bifurcations for the raw read sequences, and this feature can effectively reduce storage space. Experimental results on several genome sequencing datasets show the effectiveness of BdBG over three state-of-the-art methods. BdBG is written in python and it is an open source software distributed under the MIT license, available for download at https://github.com/rongjiewang/BdBG.


Author(s):  
Sterling P. Newberry

At the 1958 meeting of our society, then known as EMSA, the author introduced the concept of microspace and suggested its use to provide adequate information storage space and the use of electron microscope techniques to provide storage and retrieval access. At this current meeting of MSA, he wishes to suggest an additional use of the power of the electron microscope.The author has been contemplating this new use for some time and would have suggested it in the EMSA fiftieth year commemorative volume, but for page limitations. There is compelling reason to put forth this suggestion today because problems have arisen in the “Standard Model” of particle physics and funds are being greatly reduced just as we need higher energy machines to resolve these problems. Therefore, any techniques which complement or augment what we can accomplish during this austerity period with the machines at hand is worth exploring.


Sign in / Sign up

Export Citation Format

Share Document