Determining the Overall Merit of Protein Identification Data Sets: rho-Diagrams andrho-Scores

David Fenyo; Brett S. Phinney; Ronald C. Beavis

doi:10.1021/pr070025y

Data mining in mass spectrometry-based proteomics studies

Science & Technology Development Journal - Engineering and Technology ◽

10.32508/stdjet.v2i4.483 ◽

2020 ◽

Vol 2 (4) ◽

pp. 258-276

Author(s):

Vu Anh Le ◽

Cam Quyen Thi Phan ◽

Thuy Huong Nguyen

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Biomedical Research ◽

Protein Level ◽

Protein Identification ◽

Biomarker Discovery ◽

Data Sets ◽

Biological Processes ◽

Data Mining Techniques ◽

Analytical Technique

The post-genomic era consists of experimental and computational efforts to meet the challenge of clarifying and understanding the function of genes and their products. Proteomic studies play a key role in this endeavour by complementing other functional genomics approaches, encompasses the large-scale analysis of complex mixtures, including the identification and quantification of proteins expressed under different conditions, the determination of their properties, modifications and functions. Understanding how biological processes are regulated at the protein level is crucial to understanding the molecular basis of diseases and often highlights the prevention, diagnosis and treatment of diseases. High-throughput technologies are widely used in proteomics to perform the analysis of thousands of proteins. Specifically, mass spectrometry (MS) is an analytical technique for characterizing biological samples and is increasingly used in protein studies because of its targeted, nontargeted, and high performance abilities. However, as large data sets are created, computational methods such as data mining techniques are required to analyze and interpret the relevant data. More specifically, the application of data mining techniques in large proteomic data sets can assist in many interpretations of data; it can reveal protein-protein interactions, improve protein identification, evaluate the experimental methods used and facilitate the diagnosis and biomarker discovery. With the rapid advances in mass spectrometry devices and experimental methodologies, MS-based proteomics has become a reliable and necessary tool for elucidating biological processes at the protein level. Over the past decade, we have witnessed a great expansion of our knowledge of human diseases with the adoption of proteomic technologies based on MS, which leads to many interesting discoveries. Here, we review recent advances of data mining in MS-based proteomics in biomedical research. Recent research in many fields shows that proteomics goes beyond the simple classification of proteins in biological systems and finally reaches its initial potential – as an essential tool to aid related disciplines, notably biomedical research. From here, there is great potential for data mining in MS-based proteomics to move beyond basic research, into clinical research and diagnostics.

Download Full-text

Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry

Molecular & Cellular Proteomics ◽

10.1074/mcp.m900317-mcp200 ◽

2009 ◽

Vol 8 (11) ◽

pp. 2405-2417 ◽

Cited By ~ 231

Author(s):

Lukas Reiter ◽

Manfred Claassen ◽

Sabine P. Schrimpf ◽

Marko Jovanovic ◽

Alexander Schmidt ◽

...

Keyword(s):

Mass Spectrometry ◽

Tandem Mass Spectrometry ◽

Protein Identification ◽

Data Sets ◽

Tandem Mass ◽

Proteomics Data ◽

False Discovery Rates ◽

False Discovery ◽

Discovery Rates

Download Full-text

Quantitative cross-linking/mass spectrometry using isotope-labeled cross-linkers and MaxQuant

10.1101/055970 ◽

2016 ◽

Cited By ~ 2

Author(s):

Zhuo A. Chen ◽

Lutz Fischer ◽

Jüergen Cox ◽

Juri Rappsilber

Keyword(s):

Mass Spectrometry ◽

Protein Identification ◽

Recall Rate ◽

Charge Ratio ◽

Cross Linking ◽

Data Sets ◽

Automated Quantification ◽

Feature List ◽

Simple Step ◽

Cross Links

AbstractThe conceptually simple step from cross-linking/mass spectrometry (CLMS) to quantitative cross-linking/mass spectrometry (QCLMS) is compounded by technical challenges. Currently, quantitative proteomics software is tightly integrated with the protein identification workflow. This prevents automatically quantifying other m/z features in a targeted manner including those associated with cross-linked peptides. Here we present a new release of MaxQuant that permits starting the quantification process from an m/z feature list. Comparing the automated quantification to a carefully manually curated test set of cross-linked peptides obtained by cross-linking C3 and C3b with BS3 and isotope-labeled BS3-d4 revealed a number of observations: 1) Fully automated process using MaxQuant can quantify cross-links in our reference dataset with 68% recall rate and 88% accuracy. 2) Hidden quantification errors can be converted into exposed failures by label-swap replica, which makes label-swap replica an essential part of QCLMS. 3) Cross-links that failed during automated quantification can be recovered by semi-automated re-quantification. The integrated workflow of MaxQuant and semi-automated assessment provides the maximum of quantified cross-links. In contrast, work on larger data sets or by less experienced users will benefit from full automation in MaxQuant.AbbreviationsBS3Bis[sulfosuccinimidyl] suberateCLMSCross-linking/mass spectrometryMS1the initial mass-to-charge-ratio (m/z) spectrum collected for all components in a sample.QCLMSQuantitative cross-linking/mass spectrometry

Download Full-text

AlphaPept, a modern and open framework for MS-based proteomics

10.1101/2021.07.23.453379 ◽

2021 ◽

Author(s):

Maximilian T Strauss ◽

Isabell Bludau ◽

Wen-Feng Zeng ◽

Eugenia Voytik ◽

Constantin Ammar ◽

...

Keyword(s):

Software Engineering ◽

Protein Identification ◽

Rapid Development ◽

Easy Access ◽

Just In Time ◽

Data Sets ◽

Domain Specific ◽

File Formats ◽

Automated Processing ◽

Downstream Analysis

In common with other omics technologies, mass spectrometry (MS)-based proteomics produces ever-increasing amounts of raw data, making their efficient analysis a principal challenge. There is a plethora of different computational tools that process the raw MS data and derive peptide and protein identification and quantification. During the last decade, there has been dramatic progress in computer science and software engineering, including collaboration tools that have transformed research and industry. To leverage these advances, we developed AlphaPept, a Python-based open-source framework for efficient processing of large high-resolution MS data sets. Using Numba for just-in-time machine code compilation on CPU and GPU, we achieve hundred-fold speed improvements while maintaining clear syntax and rapid development speed. AlphaPept uses the Python scientific stack of highly optimized packages, reducing the code base to domain-specific tasks while providing access to the latest advances in machine learning. We provide an easy on-ramp for community validation and contributions through the concept of literate programming, implemented in Jupyter Notebooks of the different modules. A framework for continuous integration, testing, and benchmarking enforces solid software engineering principles. Large datasets can rapidly be processed as shown by the analysis of hundreds of cellular proteomes in minutes per file, many-fold faster than the data acquisiton. The AlphaPept framework can be used to build automated processing pipelines using efficient HDF5 based file formats, web-serving functionality and compatibility with downstream analysis tools. Easy access for end-users is provided by one-click installation of the graphical user interface, for advanced users via a modular Python library, and for developers via a fully open GitHub repository.

Download Full-text

Comparison of Novel Decoy Database Designs for Optimizing Protein Identification Searches Using ABRF sPRG2006 Standard MS/MS Data Sets

Journal of Proteome Research ◽

10.1021/pr800792z ◽

2009 ◽

Vol 8 (4) ◽

pp. 1782-1791 ◽

Cited By ~ 30

Author(s):

Luca Bianco ◽

Jennifer A. Mead ◽

Conrad Bessant

Keyword(s):

Protein Identification ◽

Data Sets ◽

Decoy Database

Download Full-text

An example of spectrum imaging used for comparison of EELS quantitative analysis techniques on Al-Li

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010008794x ◽

1991 ◽

Vol 49 ◽

pp. 726-727

Author(s):

John A. Hunt

Keyword(s):

Quantitative Analysis ◽

Large Data ◽

Difference Spectrum ◽

Large Data Sets ◽

Foil Thickness ◽

Data Sets ◽

Analysis Techniques ◽

Spectrum Imaging ◽

Normal Spectrum ◽

Electron Energy Loss

Spectrum-imaging is a useful technique for comparing different processing methods on very large data sets which are identical for each method. This paper is concerned with comparing methods of electron energy-loss spectroscopy (EELS) quantitative analysis on the Al-Li system. The spectrum-image analyzed here was obtained from an Al-10at%Li foil aged to produce δ' precipitates that can span the foil thickness. Two 1024 channel EELS spectra offset in energy by 1 eV were recorded and stored at each pixel in the 80x80 spectrum-image (25 Mbytes). An energy range of 39-89eV (20 channels/eV) are represented. During processing the spectra are either subtracted to create an artifact corrected difference spectrum, or the energy offset is numerically removed and the spectra are added to create a normal spectrum. The spectrum-images are processed into 2D floating-point images using methods and software described in [1].

Download Full-text

Computer-aided methods for 3-D visualization of serial sections and thick biological specimens

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100129930 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1060-1061

Author(s):

Mark Ellisman ◽

Maryann Martone ◽

Gabriel Soto ◽

Eleizer Masliah ◽

David Hessler ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Three Dimensional ◽

Neuritic Plaque ◽

Dimensional Structure ◽

Data Sets ◽

Molecular Physiology ◽

Research Activities ◽

Computer Aided ◽

Dimensional Reconstruction

Structurally-oriented biologists examine cells, tissues, organelles and macromolecules in order to gain insight into cellular and molecular physiology by relating structure to function. The understanding of these structures can be greatly enhanced by the use of techniques for the visualization and quantitative analysis of three-dimensional structure. Three projects from current research activities will be presented in order to illustrate both the present capabilities of computer aided techniques as well as their limitations and future possibilities.The first project concerns the three-dimensional reconstruction of the neuritic plaques found in the brains of patients with Alzheimer's disease. We have developed a software package “Synu” for investigation of 3D data sets which has been used in conjunction with laser confocal light microscopy to study the structure of the neuritic plaque. Tissue sections of autopsy samples from patients with Alzheimer's disease were double-labeled for tau, a cytoskeletal marker for abnormal neurites, and synaptophysin, a marker of presynaptic terminals.

Download Full-text

Direct phase determination in electron crystallography: small organic molecules

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100130468 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1166-1167

Author(s):

Douglas L. Dorset

Keyword(s):

Organic Molecules ◽

Data Sets ◽

Temperature Structure ◽

3D Analysis ◽

Intensity Data ◽

Electron Crystallography ◽

Phase Determination ◽

Measured Intensity ◽

3D Data

The quantitative use of electron diffraction intensity data for the determination of crystal structures represents the pioneering achievement in the electron crystallography of organic molecules, an effort largely begun by B. K. Vainshtein and his co-workers. However, despite numerous representative structure analyses yielding results consistent with X-ray determination, this entire effort was viewed with considerable mistrust by many crystallographers. This was no doubt due to the rather high crystallographic R-factors reported for some structures and, more importantly, the failure to convince many skeptics that the measured intensity data were adequate for ab initio structure determinations.We have recently demonstrated the utility of these data sets for structure analyses by direct phase determination based on the probabilistic estimate of three- and four-phase structure invariant sums. Examples include the structure of diketopiperazine using Vainshtein's 3D data, a similar 3D analysis of the room temperature structure of thiourea, and a zonal determination of the urea structure, the latter also based on data collected by the Moscow group.

Download Full-text

Automated cell counting of astrocytes on patterned substrates containing aliphatic and charged properties

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010014124x ◽

1995 ◽

Vol 53 ◽

pp. 974-975

Author(s):

W. Shain ◽

H. Ancin ◽

H.C. Craighead ◽

M. Isaacson ◽

L. Kam ◽

...

Keyword(s):

Cell Culture ◽

Cell Attachment ◽

Culture Method ◽

Cell Counting ◽

Data Sets ◽

Nuclear Staining ◽

Double Positive ◽

A Cell ◽

Wafer Test ◽

Cell Densities

Neural protheses have potential to restore nervous system functions lost by trauma or disease. Nanofabrication extends this approach to implants for stimulating and recording from single or small groups of neurons in the spinal cord and brain; however, tissue compatibility is a major limitation to their practical application. We are using a cell culture method for quantitatively measuring cell attachment to surfaces designed for nanofabricated neural prostheses.Silicon wafer test surfaces composed of 50-μm bars separated by aliphatic regions were fabricated using methods similar to a procedure described by Kleinfeld et al. Test surfaces contained either a single or double positive charge/residue. Cyanine dyes (diIC18(3)) stained the background and cell membranes (Fig 1); however, identification of individual cells at higher densities was difficult (Fig 2). Nuclear staining with acriflavine allowed discrimination of individual cells and permitted automated counting of nuclei using 3-D data sets from the confocal microscope (Fig 3). For cell attachment assays, LRM5 5 astroglial cells and astrocytes in primary cell culture were plated at increasing cell densities on test substrates, incubated for 24 hr, fixed, stained, mounted on coverslips, and imaged with a 10x objective.

Download Full-text

Cluster analysis for large data sets: applications to individual aerosol particles from the mid-pacific

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100132078 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1488-1489

Author(s):

Thomas W. Shattuck ◽

James R. Anderson ◽

Neil W. Tindale ◽

Peter R. Buseck

Keyword(s):

Cluster Analysis ◽

Chemical Reactivity ◽

Large Data ◽

Large Data Sets ◽

Particle Analysis ◽

Data Sets ◽

Halogen Chemistry ◽

Complete Study ◽

Components Analysis ◽

Automated Scanning

Individual particle analysis involves the study of tens of thousands of particles using automated scanning electron microscopy and elemental analysis by energy-dispersive, x-ray emission spectroscopy (EDS). EDS produces large data sets that must be analyzed using multi-variate statistical techniques. A complete study uses cluster analysis, discriminant analysis, and factor or principal components analysis (PCA). The three techniques are used in the study of particles sampled during the FeLine cruise to the mid-Pacific ocean in the summer of 1990. The mid-Pacific aerosol provides information on long range particle transport, iron deposition, sea salt ageing, and halogen chemistry.Aerosol particle data sets suffer from a number of difficulties for pattern recognition using cluster analysis. There is a great disparity in the number of observations per cluster and the range of the variables in each cluster. The variables are not normally distributed, they are subject to considerable experimental error, and many values are zero, because of finite detection limits. Many of the clusters show considerable overlap, because of natural variability, agglomeration, and chemical reactivity.

Download Full-text