PolySTest: Robust statistical testing of proteomics data with missing values improves detection of biologically relevant features

Mapping Intimacies ◽

10.1101/765818 ◽

2019 ◽

Cited By ~ 1

Author(s):

Veit Schwämmle ◽

Christina E Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

AbstractStatistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss Test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss Test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10%-20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss Test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features

Molecular & Cellular Proteomics ◽

10.1074/mcp.ra119.001777 ◽

2020 ◽

Vol 19 (8) ◽

pp. 1396-1408 ◽

Cited By ~ 2

Author(s):

Veit Schwämmle ◽

Christina E. Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10–20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

Tail-Robust Quantile Normalization

10.1101/2020.04.17.046227 ◽

2020 ◽

Author(s):

Eva Brombacher ◽

Ariane Schad ◽

Clemens Kreutz

Keyword(s):

Mass Spectrometry ◽

Microarray Data ◽

Missing Values ◽

Statistical Tests ◽

Biological Data ◽

Quantile Normalization ◽

Proteomics Data ◽

Experimental Conditions ◽

Novel Approach ◽

Biological Signals

AbstractHigh-throughput biological data – such as mass spectrometry-based proteomics data – suffer from systematic non-biological variance, which is introduced by systematic errors such as batch effects. This hinders the estimation of ‘real’ biological signals and, thus, decreases the power of statistical tests and biases the identification of differentially expressed sample classes. To remove such unintended variation, while retaining the biological signal of interest, the analysis workflows for mass spectrometry-based quantification typically comprises normalization steps prior to the statistical analysis of the data. Several normalization methods, such as quantile normalization, have originally been developed for microarray data. However, unlike microarray data, proteomics data may contain features, in the form of protein intensities, that are consistently highly abundant across experimental conditions and, hence, are encountered in the tails of the protein intensity distribution. If such proteins are present, statistical inferences of the intensity profiles of the normalized features are impeded through the increased number of false positive findings due to the biased estimation of the variance of the data. Thus, we developed a, freely available, novel approach: ‘tail-robust quantile normalization’. It extends the traditional quantile normalization to preserve the biological signals of features in the tails of the distribution over experimental conditions and to account for sample-dependent missing values.

Download Full-text

MSpectraAI: a powerful platform for deciphering proteome profiling of multi-tumor mass spectrometry data by using deep neural networks

BMC Bioinformatics ◽

10.1186/s12859-020-03783-0 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Shisheng Wang ◽

Hongwen Zhu ◽

Hu Zhou ◽

Jingqiu Cheng ◽

Hao Yang

Keyword(s):

Mass Spectrometry ◽

Neural Networks ◽

Large Scale ◽

Deep Neural Networks ◽

Spectral Feature ◽

Mass Spectrometry Data ◽

Learning Approaches ◽

Proteomics Data ◽

Proteome Profiling ◽

Analytical Technique

Abstract Background Mass spectrometry (MS) has become a promising analytical technique to acquire proteomics information for the characterization of biological samples. Nevertheless, most studies focus on the final proteins identified through a suite of algorithms by using partial MS spectra to compare with the sequence database, while the pattern recognition and classification of raw mass-spectrometric data remain unresolved. Results We developed an open-source and comprehensive platform, named MSpectraAI, for analyzing large-scale MS data through deep neural networks (DNNs); this system involves spectral-feature swath extraction, classification, and visualization. Moreover, this platform allows users to create their own DNN model by using Keras. To evaluate this tool, we collected the publicly available proteomics datasets of six tumor types (a total of 7,997,805 mass spectra) from the ProteomeXchange consortium and classified the samples based on the spectra profiling. The results suggest that MSpectraAI can distinguish different types of samples based on the fingerprint spectrum and achieve better prediction accuracy in MS1 level (average 0.967). Conclusion This study deciphers proteome profiling of raw mass spectrometry data and broadens the promising application of the classification and prediction of proteomics data from multi-tumor samples using deep learning methods. MSpectraAI also shows a better performance compared to the other classical machine learning approaches.

Download Full-text

Missing Value Imputation using XGboost for Label-Free Mass Spectrometry-Based Proteomics Data

10.1101/2021.04.08.438945 ◽

2021 ◽

Author(s):

Jian Song ◽

Changbin Yu

Keyword(s):

Mass Spectrometry ◽

High Performance ◽

Missing Values ◽

Mean Squared Error ◽

Learning Algorithm ◽

Pearson Correlation ◽

Data Matrix ◽

Label Free ◽

Proteomics Data ◽

Benchmark Datasets

AbstractThe label-free mass spectrometry-based proteomics data inevitably suffer from the problem of missing values. The existence of missing values prevents the downstream analyses which need a complete data matrix. Our motivation is to introduce the state-of-art machine learning algorithm XGboost to realize a method of imputation which can improve the accuracy of imputation. But in practical, XGboost has many parameters need to be tuned to deliver on its potential high performance. Although cross validation may find the best parameters, it is much time-consuming. Alternatively, we empirically determined the parameters to two kinds of base learners of XGboost. To explore the robustness and performance of XGboost based imputation with predetermined parameters, we conducted tests on three benchmark datasets. As a comparative, six common imputation methods were also experimented in terms of normalized root mean squared error and Pearson correlation coefficient. The comparative experimental results indicated that the XGboost based imputation method using the linear base learner is competitive to or out-performs its competitors, including the random forest based imputation, by achieving smaller imputation errors and better structure preservation under the empirical parameters for the three benchmark datasets.

Download Full-text

proDA: Probabilistic Dropout Analysis for Identifying Differentially Abundant Proteins in Label-Free Mass Spectrometry

10.21203/rs.3.rs-36351/v1 ◽

2020 ◽

Author(s):

Constantin Ahlmann-Eltze ◽

Simon Anders

Keyword(s):

Mass Spectrometry ◽

Quantitative Proteomics ◽

Statistical Power ◽

Missing Values ◽

Ad Hoc ◽

Linear Models ◽

Statistical Tests ◽

High Sensitivity ◽

Small Sample ◽

Label Free

Abstract Protein mass spectrometry with label-free quantification (LFQ) is widely used for quantitative proteomics studies. Nevertheless, well-principled statistical inference procedures are still lacking, and most practitioners adopt methods from transcriptomics. These, however, cannot properly treat the principal complication of label-free proteomics, namely many non-randomly missing values. We present proDA, a method to perform statistical tests for differential abundance of proteins. It models missing values in an intensity-dependent probabilistic manner. proDA is based on linear models and thus suitable for complex experimental designs, and boosts statistical power for small sample sizes by using variance moderation. We show that the currently widely used methods based on ad hoc imputation schemes can report excessive false positives, and that proDA not only overcomes this serious issue but also offers high sensitivity. Thus, proDA fills a crucial gap in the toolbox of quantitative proteomics.

Download Full-text

Infer related genes from large scale gene expression dataset with embedding

10.1101/362848 ◽

2018 ◽

Author(s):

Chi Tung Choy ◽

Chi Hang Wong ◽

Stephen Lam Chan

Keyword(s):

Gene Expression ◽

Large Scale ◽

Gene List ◽

Ground Truth ◽

Relevant Information ◽

Molecular Data ◽

Biological Data ◽

Gene Expression Dataset ◽

Biologically Relevant ◽

Unsupervised Data Mining

AbstractArtificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We adopted a method of unsupervised ANN, namely word embedding, to extract biologically relevant information from TCGA gene expression dataset. Ground truth relationship, such as cancer types of the input sample and semantic meaning of genes, were showed to retain in the resulting entity matrices. We also demonstrated the interpretability and usage of these matrices in shortlisting candidates from a long gene list. This method is feasible to mine big volume of biological data, and would be a valuable tool to discover novel knowledge from omics data. The resulting embedding matrices mined from TCGA gene expression data are interactively explorable online (http://bit.ly/tcga-embedding-cancer) and could serve as an informative reference.

Download Full-text

A semi-supervised Bayesian approach for simultaneous protein sub-cellular localisation assignment and novelty detection

10.1101/2020.05.05.078345 ◽

2020 ◽

Cited By ~ 2

Author(s):

Oliver M. Crook ◽

Aikaterini Geladaki ◽

Daniel J.H. Nightingale ◽

Owen Vennard ◽

Kathryn S. Lilley ◽

...

Keyword(s):

Mass Spectrometry ◽

Bayesian Approach ◽

Novelty Detection ◽

Machine Learning Algorithms ◽

Proteomics Data ◽

Diverse Range ◽

Biologically Relevant ◽

Cellular Localisation ◽

Associated Proteins ◽

Experimental Protocols

AbstractThe cell is compartmentalised into complex micro-environments allowing an array of specialised biological processes to be carried out in synchrony. Determining a protein’s sub-cellular localisation to one or more of these compartments can therefore be a first step in determining its function. High-throughput and high-accuracy mass spectrometry-based sub-cellular proteomic methods can now shed light on the localisation of thousands of proteins at once. Machine learning algorithms are then typically employed to make protein-organelle assignments. However, these algorithms are limited by insufficient and incomplete annotation. We propose a semi-supervised Bayesian approach to novelty detection, allowing the discovery of additional, previously unannotated sub-cellular niches. Inference in our model is performed in a Bayesian framework, allowing us to quantify uncertainty in the allocation of proteins to new sub-cellular niches, as well as in the number of newly discovered compartments. We apply our approach across 10 mass spectrometry based spatial proteomic datasets, representing a diverse range of experimental protocols. Application of our approach to hyper LOPIT datasets validates its utility by recovering enrichment with chromatin-associated proteins without annotation and uncovers sub-nuclear compartmentalisation which was not identified in the original analysis. Moreover, using sub-cellular proteomics data from Saccharomyces cerevisiae, we uncover a novel group of proteins trafficking from the ER to the early Golgi apparatus. Overall, we demonstrate the potential for novelty detection to yield biologically relevant niches that are missed by current approaches.

Download Full-text

Large-scale directed network inference with multivariate transfer entropy and hierarchical statistical testing

Network Neuroscience ◽

10.1162/netn_a_00092 ◽

2019 ◽

Vol 3 (3) ◽

pp. 827-847 ◽

Cited By ~ 16

Author(s):

Leonardo Novelli ◽

Patricia Wollstadt ◽

Pedro Mediano ◽

Michael Wibral ◽

Joseph T. Lizier

Keyword(s):

Time Series ◽

Large Scale ◽

Network Inference ◽

Statistical Tests ◽

Statistical Significance ◽

Greedy Algorithms ◽

Transfer Entropy ◽

Statistical Testing ◽

Directed Network ◽

Model Free

Network inference algorithms are valuable tools for the study of large-scale neuroimaging datasets. Multivariate transfer entropy is well suited for this task, being a model-free measure that captures nonlinear and lagged dependencies between time series to infer a minimal directed network model. Greedy algorithms have been proposed to efficiently deal with high-dimensional datasets while avoiding redundant inferences and capturing synergistic effects. However, multiple statistical comparisons may inflate the false positive rate and are computationally demanding, which limited the size of previous validation studies. The algorithm we present—as implemented in the IDTxl open-source software—addresses these challenges by employing hierarchical statistical tests to control the family-wise error rate and to allow for efficient parallelization. The method was validated on synthetic datasets involving random networks of increasing size (up to 100 nodes), for both linear and nonlinear dynamics. The performance increased with the length of the time series, reaching consistently high precision, recall, and specificity (>98% on average) for 10,000 time samples. Varying the statistical significance threshold showed a more favorable precision-recall trade-off for longer time series. Both the network size and the sample size are one order of magnitude larger than previously demonstrated, showing feasibility for typical EEG and magnetoencephalography experiments.

Download Full-text

CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis

BMC Bioinformatics ◽

10.1186/s12859-021-03969-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Olga Permiakova ◽

Romain Guibert ◽

Alexandra Kraut ◽

Thomas Fortin ◽

Anne-Marie Hesse ◽

...

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Clustering Algorithm ◽

Optimal Transport ◽

State Of The Art ◽

Data Representation ◽

Machine Learning Algorithms ◽

Mass Spectrometry Data ◽

Proteomics Data ◽

Chromatographic Elution

Abstract Background The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. Results We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. Conclusions Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data.

Download Full-text

IceR improves proteome coverage and data completeness in global and single-cell proteomics

Nature Communications ◽

10.1038/s41467-021-25077-6 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Mathias Kalxdorf ◽

Torsten Müller ◽

Oliver Stegle ◽

Jeroen Krijgsveld

Keyword(s):

Single Cell ◽

Large Scale ◽

Missing Values ◽

Peptide Identification ◽

Protein Quantification ◽

Developmental Trajectory ◽

Ion Current ◽

Label Free ◽

Proteomics Data ◽

Data Completeness

AbstractLabel-free proteomics by data-dependent acquisition enables the unbiased quantification of thousands of proteins, however it notoriously suffers from high rates of missing values, thus prohibiting consistent protein quantification across large sample cohorts. To solve this, we here present IceR (Ion current extraction Re-quantification), an efficient and user-friendly quantification workflow that combines high identification rates of data-dependent acquisition with low missing value rates similar to data-independent acquisition. Specifically, IceR uses ion current information for a hybrid peptide identification propagation approach with superior quantification precision, accuracy, reliability and data completeness compared to other quantitative workflows. Applied to plasma and single-cell proteomics data, IceR enhanced the number of reliably quantified proteins, improved discriminability between single-cell populations, and allowed reconstruction of a developmental trajectory. IceR will be useful to improve performance of large scale global as well as low-input proteomics applications, facilitated by its availability as an easy-to-use R-package.

Download Full-text