Discover True Association Rates in Multi-protein Complex Proteomics Data Sets

Author(s):  
Changyu Shen ◽  
Lang Li ◽  
Jake Yue Chen
F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 741 ◽  
Author(s):  
Kevin Rue-Albrecht ◽  
Federico Marini ◽  
Charlotte Soneson ◽  
Aaron T.L. Lun

Data exploration is critical to the comprehension of large biological data sets generated by high-throughput assays such as sequencing. However, most existing tools for interactive visualisation are limited to specific assays or analyses. Here, we present the iSEE (Interactive SummarizedExperiment Explorer) software package, which provides a general visual interface for exploring data in a SummarizedExperiment object. iSEE is directly compatible with many existing R/Bioconductor packages for analysing high-throughput biological data, and provides useful features such as simultaneous examination of (meta)data and analysis results, dynamic linking between plots and code tracking for reproducibility. We demonstrate the utility and flexibility of iSEE by applying it to explore a range of real transcriptomics and proteomics data sets.


2021 ◽  
Author(s):  
Benbo Gao ◽  
Jing Zhu ◽  
Soumya Negi ◽  
Xinmin Zhang ◽  
Stefka Gyoneva ◽  
...  

AbstractSummaryWe developed Quickomics, a feature-rich R Shiny-powered tool to enable biologists to fully explore complex omics data and perform advanced analysis in an easy-to-use interactive interface. It covers a broad range of secondary and tertiary analytical tasks after primary analysis of omics data is completed. Each functional module is equipped with customized configurations and generates both interactive and publication-ready high-resolution plots to uncover biological insights from data. The modular design makes the tool extensible with ease.AvailabilityResearchers can experience the functionalities with their own data or demo RNA-Seq and proteomics data sets by using the app hosted at http://quickomics.bxgenomics.com and following the tutorial, https://bit.ly/3rXIyhL. The source code under GPLv3 license is provided at https://github.com/interactivereport/[email protected], [email protected] informationSupplementary materials are available at https://bit.ly/37HP17g.


2021 ◽  
Author(s):  
Andrew J Kavran ◽  
Aaron Clauset

Abstract Background: Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation.Results: We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “filtered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network filter may be applied to an entire system, or the system may be first decomposed into distinct modules and a different filter applied to each. Applied to synthetic data with known network structure and signal, network filters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network filtering prior to training increases accuracy up to 43% compared to using unfiltered data.Conclusions: Network filters are a general way to denoise biological data and can account for both correlation and anti-correlation between different measurements. Furthermore, we find that partitioning a network prior to filtering can significantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diffusion based methods. Our results on proteomics data indicate the broad potential utility of network filters to applications in systems biology.


2018 ◽  
Vol 17 ◽  
pp. 117693511877108 ◽  
Author(s):  
Min Wang ◽  
Steven M Kornblau ◽  
Kevin R Coombes

Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 “biological components,” 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.


2018 ◽  
Vol 17 (4) ◽  
pp. 1547-1558 ◽  
Author(s):  
Salvador Martínez-Bartolomé ◽  
J. Alberto Medina-Aunon ◽  
Miguel Ángel López-García ◽  
Carmen González-Tejedo ◽  
Gorka Prieto ◽  
...  

2020 ◽  
Author(s):  
Weiping Ma ◽  
Sunkyu Kim ◽  
Shrabanti Chowdhury ◽  
Zhi Li ◽  
Mi Yang ◽  
...  

AbstractDeep proteomics profiling using labelled LC-MS/MS experiments has been proven to be powerful to study complex diseases. However, due to the dynamic nature of the discovery mass spectrometry, the generated data contain a substantial fraction of missing values. This poses great challenges for data analyses, as many tools, especially those for high dimensional data, cannot deal with missing values directly. To address this problem, the NCI-CPTAC Proteogenomics DREAM Challenge was carried out to develop effective imputation algorithms for labelled LC-MS/MS proteomics data through crowd learning. The final resulting algorithm, DreamAI, is based on an ensemble of six different imputation methods. The imputation accuracy of DreamAI, as measured by correlation, is about 15%-50% greater than existing tools among less abundant proteins, which are more vulnerable to be missed in proteomics data sets. This new tool nicely enhances data analysis capabilities in proteomics research.


2021 ◽  
Vol 22 (17) ◽  
pp. 9650
Author(s):  
Miranda L. Gardner ◽  
Michael A. Freitas

Analysis of differential abundance in proteomics data sets requires careful application of missing value imputation. Missing abundance values widely vary when performing comparisons across different sample treatments. For example, one would expect a consistent rate of “missing at random” (MAR) across batches of samples and varying rates of “missing not at random” (MNAR) depending on the inherent difference in sample treatments within the study. The missing value imputation strategy must thus be selected that best accounts for both MAR and MNAR simultaneously. Several important issues must be considered when deciding the appropriate missing value imputation strategy: (1) when it is appropriate to impute data; (2) how to choose a method that reflects the combinatorial manner of MAR and MNAR that occurs in an experiment. This paper provides an evaluation of missing value imputation strategies used in proteomics and presents a case for the use of hybrid left-censored missing value imputation approaches that can handle the MNAR problem common to proteomics data.


2018 ◽  
Author(s):  
Li Chen ◽  
Bai Zhang ◽  
Michael Schnaubelt ◽  
Punit Shah ◽  
Paul Aiyetan ◽  
...  

ABSTRACTRapid development and wide adoption of mass spectrometry-based proteomics technologies have empowered scientists to study proteins and their modifications in complex samples on a large scale. This progress has also created unprecedented challenges for individual labs to store, manage and analyze proteomics data, both in the cost for proprietary software and high-performance computing, and the long processing time that discourages on-the-fly changes of data processing settings required in explorative and discovery analysis. We developed an open-source, cloud computing-based pipeline, MS-PyCloud, with graphical user interface (GUI) support, for LC-MS/MS data analysis. The major components of this pipeline include data file integrity validation, MS/MS database search for spectral assignment, false discovery rate estimation, protein inference, determination of protein post-translation modifications, and quantitation of specific (modified) peptides and proteins. To ensure the transparency and reproducibility of data analysis, MS-PyCloud includes open source software tools with comprehensive testing and versioning for spectrum assignments. Leveraging public cloud computing infrastructure via Amazon Web Services (AWS), MS-PyCloud scales seamlessly based on analysis demand to achieve fast and efficient performance. Application of the pipeline to the analysis of large-scale iTRAQ/TMT LC-MS/MS data sets demonstrated the effectiveness and high performance of MS-PyCloud. The software can be downloaded at: https://bitbucket.org/mschnau/ms-pycloud/downloads/


Sign in / Sign up

Export Citation Format

Share Document