MassComp, a lossless compressor for mass spectrometry data

Mapping Intimacies ◽

10.1101/542894 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ruochen Yang ◽

Xi Chen ◽

Idoia Ochoa

Keyword(s):

Mass Spectrometry ◽

Numerical Data ◽

Mass Spectrometry Data ◽

Compression Algorithms ◽

Compression Performance ◽

Average Improvement ◽

The Family ◽

Efficient Representation ◽

Cost Efficient ◽

Biology Research

Background: Mass Spectrometry (MS) is a widely used technique in biology research, and has become key in proteomics and metabolomics analyses. As a result, the amount of MS data has significantly increased in recent years. For example, the MS repository MassIVE contains more than 123TB of data. Somehow surprisingly, these data are stored uncompressed, hence incurring a significant storage cost. Efficient representation of these data is therefore paramount to lessen the burden of storage and facilitate its dissemination. Results We present MassComp, a lossless compressor optimized for the numerical (m/z)-intensity pairs that account for most of the MS data. We tested MassComp on several MS data and show that it delivers on average a 46% reduction on the size of the numerical data, and up to 89%. These results correspond to an average improvement of more than 27% when compared to the general compressor gzip and of 40% when compared to the state-of-the-art numerical compressor FPC. When tested on entire files retrieved from the MassIVE repository, MassComp achieves on average a 59% size reduction. MassComp is written in C++ and freely available at https://github.com/iochoa/MassComp. Conclusions: The compression performance of MassComp demonstrates its potential to significantly reduce the footprint of MS data, and shows the benefits of designing specialized compression algorithms tailored to MS data. MassComp is an addition to the family of omics compression algorithms designed to lessen the storage burden and facilitate the exchange and dissemination of omics data.

Download Full-text

Aird: A computation-oriented mass spectrometry data format enables higher compression ratio and less decoding time

10.1101/2020.10.14.338921 ◽

2020 ◽

Author(s):

Miaoshan Lu ◽

Shaowei An ◽

Ruimin Wang ◽

Jinyin Wang ◽

Changbin Yu

Keyword(s):

Mass Spectrometry ◽

Lossless Compression ◽

Mass Spectrometry Data ◽

Compression Rate ◽

File Size ◽

Data Format ◽

Compression Algorithms ◽

Link Type ◽

Processing Algorithms ◽

Data Independence

ABSTRACTWith the precision of mass spectrometer going higher and the emergence of data independence acquisition (DIA), the file size is increasing rapidly. Beyond the widely-used open format mzML (Deutsch 2008), near-lossless or lossless compression algorithms and formats have emerged. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focusing more on lossless compression and compression rate, computation-oriented formats focus as much on decoding speed and disk read strategy as compression rate. Here we describe “Aird", an opensource and computation-oriented format with controllable precision, flexible indexing strategies and high compression rate. Aird uses JavaScript Object Notation (JSON) for metadata storage, multiple indexing, and reordered storage strategies for higher speed of data randomly reading. Aird also provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data compression. Compared with Zlib only, m/z data size is about 65% lower in Aird, and merely takes 33% decoding time.AvailabilityAird SDK is written in Java, which allow scholars to access mass spectrometry data efficiently. It is available at https://github.com/Propro-Studio/Aird-SDK AirdPro can convert vendor files into Aird files, which is available at https://github.com/Propro-Studio/AirdPro

Download Full-text

A recombineering pipeline to clone large and complex genes in Chlamydomonas

The Plant Cell ◽

10.1093/plcell/koab024 ◽

2021 ◽

Author(s):

Tom Z Emrich-Mills ◽

Gary Yates ◽

James Barrett ◽

Philipp Girr ◽

Irina Grouneva ◽

...

Keyword(s):

Mass Spectrometry ◽

Bacterial Artificial Chromosome ◽

Molecular Biology ◽

Fluorescent Protein ◽

Artificial Chromosome ◽

Research Community ◽

Mass Spectrometry Data ◽

Parallel Cloning ◽

Mutant Complementation ◽

Biology Research

Abstract The ability to clone genes has greatly advanced cell and molecular biology research, enabling researchers to generate fluorescent protein fusions for localization and confirm genetic causation by mutant complementation. Most gene cloning is PCR or DNA synthesis dependent, which can become costly and technically challenging as genes increase in size, particularly if they contain complex regions. This has been a long-standing challenge for the Chlamydomonas reinhardtii research community, as this alga has a high percentage of genes containing complex sequence structures. Here we overcame these challenges by developing a recombineering pipeline for the rapid parallel cloning of genes from a Chlamydomonas bacterial artificial chromosome collection. To generate fluorescent protein fusions for localization, we applied the pipeline at both batch and high-throughput scales to 203 genes related to the Chlamydomonas CO2 concentrating mechanism (CCM), with an overall cloning success rate of 77%. Cloning success was independent of gene size and complexity, with cloned genes as large as 23 kilobases. Localization of a subset of CCM targets confirmed previous mass spectrometry data, identified new pyrenoid components, and enabled complementation of mutants. We provide vectors and detailed protocols to facilitate easy adoption of this technology, which we envision will open up new possibilities in algal and plant research.

Download Full-text

mspack: efficient lossless and lossy mass spectrometry data compression

Bioinformatics ◽

10.1093/bioinformatics/btab636 ◽

2021 ◽

Author(s):

Felix Hanau ◽

Hannes Röst ◽

Idoia Ochoa

Keyword(s):

Mass Spectrometry ◽

Lossless Compression ◽

Lossy Compression ◽

General Purpose ◽

Mass Spectrometry Data ◽

Supplementary Information ◽

Compression Algorithms ◽

Single File ◽

Comparable Accuracy ◽

Better Than

Abstract Motivation Mass spectrometry data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for Mass Spectrometry (MS) data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio. Results We tested mspack on several datasets generated by commonly used mass spectrometry instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared to the original files. Lossless mspack achieves 10 - 60% lower file sizes than MassComp, and lossy mspack compresses 36 - 60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time. Availability mspack is implemented in C ++ and freely available at https://github.com/fhanau/mspack under the Apache license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

154: Integration of TPSA and High-Throughput Mass Spectrometry Data Improves Prostate Cancer Prediction

The Journal of Urology ◽

10.1016/s0022-5347(18)30419-1 ◽

2007 ◽

Vol 177 (4S) ◽

pp. 52-53

Author(s):

Stefano Ongarello ◽

Eberhard Steiner ◽

Regina Achleitner ◽

Isabel Feuerstein ◽

Birgit Stenzel ◽

...

Keyword(s):

Prostate Cancer ◽

Mass Spectrometry ◽

High Throughput ◽

Mass Spectrometry Data ◽

Cancer Prediction

Download Full-text

Nonparametric Pre-Processing Methods and Inference Tools for Analyzing Time-of-Flight Mass Spectrometry Data.

Current Analytical Chemistry ◽

10.2174/157341107780361718 ◽

2007 ◽

Vol 3 (2) ◽

pp. 127-147 ◽

Cited By ~ 8

Author(s):

Anestis Antoniadis ◽

Jeremie Bigot ◽

Sophie Lambert-Lacroix ◽

Frederique Letue

Keyword(s):

Mass Spectrometry ◽

Time Of Flight ◽

Mass Spectrometry Data ◽

Processing Methods ◽

Flight Mass Spectrometry

Download Full-text

Ultra‐Fast Retroactive Processing by MetAlign of Liquid‐Chromatography High‐Resolution Full‐Scan Orbitrap Mass Spectrometry Data in WADA Human Urine Sample Monitoring Program

Rapid Communications in Mass Spectrometry ◽

10.1002/rcm.9141 ◽

2021 ◽

Author(s):

Safa Khelifi ◽

Khadija Saad ◽

Ariadni Vonaparti ◽

Souhila Mahieddine ◽

Sofia Salama ◽

...

Keyword(s):

Mass Spectrometry ◽

Liquid Chromatography ◽

High Resolution ◽

Urine Sample ◽

Human Urine ◽

Monitoring Program ◽

Mass Spectrometry Data ◽

Human Urine Sample ◽

Orbitrap Mass Spectrometry ◽

Full Scan

Download Full-text

Interlaboratory Comparison of Untargeted Mass Spectrometry Data Uncovers Underlying Causes for Variability

Journal of Natural Products ◽

10.1021/acs.jnatprod.0c01376 ◽

2021 ◽

Author(s):

Trevor N. Clark ◽

Joëlle Houriet ◽

Warren S. Vidar ◽

Joshua J. Kellogg ◽

Daniel A. Todd ◽

...

Keyword(s):

Mass Spectrometry ◽

Interlaboratory Comparison ◽

Mass Spectrometry Data ◽

Underlying Causes

Download Full-text

Deep Convolutional Neural Networks Help Scoring Tandem Mass Spectrometry Data in Database-Searching Approaches

Journal of Proteome Research ◽

10.1021/acs.jproteome.1c00315 ◽

2021 ◽

Vol 20 (10) ◽

pp. 4708-4717

Author(s):

Polina Kudriavtseva ◽

Matvey Kashkinov ◽

Attila Kertész-Farkas

Keyword(s):

Mass Spectrometry ◽

Neural Networks ◽

Tandem Mass Spectrometry ◽

Convolutional Neural Networks ◽

Mass Spectrometry Data ◽

Database Searching ◽

Tandem Mass ◽

Deep Convolutional Neural Networks ◽

Tandem Mass Spectrometry Data

Download Full-text

TopPIC Gateway: A Web Gateway for Top-Down Mass Spectrometry Data Interpretation

Practice and Experience in Advanced Research Computing ◽

10.1145/3311790.3400853 ◽

2020 ◽

Author(s):

In Kwon Choi ◽

Eroma Abeysinghe ◽

Eric Coulter ◽

Suresh Marru ◽

Marlon Pierce ◽

...

Keyword(s):

Mass Spectrometry ◽

Data Interpretation ◽

Mass Spectrometry Data ◽

Top Down ◽

Top Down Mass Spectrometry

Download Full-text

mzML—a Community Standard for Mass Spectrometry Data

Molecular & Cellular Proteomics ◽

10.1074/mcp.r110.000133 ◽

2010 ◽

Vol 10 (1) ◽

pp. R110.000133 ◽

Cited By ~ 363

Author(s):

Lennart Martens ◽

Matthew Chambers ◽

Marc Sturm ◽

Darren Kessner ◽

Fredrik Levander ◽

...

Keyword(s):

Mass Spectrometry ◽

Mass Spectrometry Data ◽

Community Standard

Download Full-text