Data mining in mass spectrometry-based proteomics studies

Vu Anh Le; Cam Quyen Thi Phan; Thuy Huong Nguyen

doi:10.32508/stdjet.v2i4.483

Data mining in mass spectrometry-based proteomics studies

Science & Technology Development Journal - Engineering and Technology ◽

10.32508/stdjet.v2i4.483 ◽

2020 ◽

Vol 2 (4) ◽

pp. 258-276

Author(s):

Vu Anh Le ◽

Cam Quyen Thi Phan ◽

Thuy Huong Nguyen

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Biomedical Research ◽

Protein Level ◽

Protein Identification ◽

Biomarker Discovery ◽

Data Sets ◽

Biological Processes ◽

Data Mining Techniques ◽

Analytical Technique

The post-genomic era consists of experimental and computational efforts to meet the challenge of clarifying and understanding the function of genes and their products. Proteomic studies play a key role in this endeavour by complementing other functional genomics approaches, encompasses the large-scale analysis of complex mixtures, including the identification and quantification of proteins expressed under different conditions, the determination of their properties, modifications and functions. Understanding how biological processes are regulated at the protein level is crucial to understanding the molecular basis of diseases and often highlights the prevention, diagnosis and treatment of diseases. High-throughput technologies are widely used in proteomics to perform the analysis of thousands of proteins. Specifically, mass spectrometry (MS) is an analytical technique for characterizing biological samples and is increasingly used in protein studies because of its targeted, nontargeted, and high performance abilities. However, as large data sets are created, computational methods such as data mining techniques are required to analyze and interpret the relevant data. More specifically, the application of data mining techniques in large proteomic data sets can assist in many interpretations of data; it can reveal protein-protein interactions, improve protein identification, evaluate the experimental methods used and facilitate the diagnosis and biomarker discovery. With the rapid advances in mass spectrometry devices and experimental methodologies, MS-based proteomics has become a reliable and necessary tool for elucidating biological processes at the protein level. Over the past decade, we have witnessed a great expansion of our knowledge of human diseases with the adoption of proteomic technologies based on MS, which leads to many interesting discoveries. Here, we review recent advances of data mining in MS-based proteomics in biomedical research. Recent research in many fields shows that proteomics goes beyond the simple classification of proteins in biological systems and finally reaches its initial potential – as an essential tool to aid related disciplines, notably biomedical research. From here, there is great potential for data mining in MS-based proteomics to move beyond basic research, into clinical research and diagnostics.

Download Full-text

Bibliomining for Library Decision-Making

Encyclopedia of Information Science and Technology, Second Edition ◽

10.4018/978-1-60566-026-4.ch058 ◽

2011 ◽

pp. 341-345

Author(s):

Scott Nicholson ◽

Jeffrey Stanton

Keyword(s):

Data Mining ◽

Digital Libraries ◽

Large Data ◽

Data Sets ◽

Data Mining Techniques ◽

Governmental Organizations ◽

Non Governmental Organizations ◽

The People ◽

The World ◽

Use Of The Internet

Most people think of a library as the little brick building in the heart of their community or the big brick building in the center of a campus. These notions greatly oversimplify the world of libraries, however. Most large commercial organizations have dedicated in-house library operations, as do schools, non-governmental organizations, as well as local, state, and federal governments. With the increasing use of the Internet and the World Wide Web, digital libraries have burgeoned, and these serve a huge variety of different user audiences. With this expanded view of libraries, two key insights arise. First, libraries are typically embedded within larger institutions. Corporate libraries serve their corporations, academic libraries serve their universities, and public libraries serve taxpaying communities who elect overseeing representatives. Second, libraries play a pivotal role within their institutions as repositories and providers of information resources. In the provider role, libraries represent in microcosm the intellectual and learning activities of the people who comprise the institution. This fact provides the basis for the strategic importance of library data mining: By ascertaining what users are seeking, bibliomining can reveal insights that have meaning in the context of the library’s host institution. Use of data mining to examine library data might be aptly termed bibliomining. With widespread adoption of computerized catalogs and search facilities over the past quarter century, library and information scientists have often used bibliometric methods (e.g., the discovery of patterns in authorship and citation within a field) to explore patterns in bibliographic information. During the same period, various researchers have developed and tested data mining techniques—advanced statistical and visualization methods to locate non-trivial patterns in large data sets. Bibliomining refers to the use of these bibliometric and data mining techniques to explore the enormous quantities of data generated by the typical automated library.

Download Full-text

Data Mining in Protein Identification by Tandem Mass Spectrometry

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch074 ◽

2011 ◽

pp. 472-478

Author(s):

Haipeng Wang

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Tandem Mass Spectrometry ◽

Large Scale ◽

Protein Identification ◽

Peptide Fragmentation ◽

Tandem Mass ◽

Fragmentation Pattern ◽

Post Translational Modification ◽

Challenges And Opportunities

Protein identification (sequencing) by tandem mass spectrometry is a fundamental technique for proteomics which studies structures and functions of proteins in large scale and acts as a complement to genomics. Analysis and interpretation of vast amounts of spectral data generated in proteomics experiments present unprecedented challenges and opportunities for data mining in areas such as data preprocessing, peptide-spectrum matching, results validation, peptide fragmentation pattern discovery and modeling, and post-translational modification (PTM) analysis. This article introduces the basic concepts and terms of protein identification and briefly reviews the state-of-the-art relevant data mining applications. It also outlines challenges and future potential hot spots in this field.

Download Full-text

Management and Analysis of Mass Spectrometry Proteomics Data on the Grid

Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine, and Healthcare ◽

10.4018/978-1-60566-374-6.ch011 ◽

2011 ◽

pp. 206-227

Author(s):

Mario Cannataro ◽

Pietro Hiram Guzzi ◽

Giuseppe Tradigo ◽

Pierangelo Veltri

Keyword(s):

Mass Spectrometry ◽

Protein Identification ◽

Biomarker Discovery ◽

Integrated Management ◽

Molecular Signature ◽

Proteomics Data ◽

Computational Proteomics ◽

Manual Inspection ◽

A Cell ◽

Grid Based

Recent advances in high throughput technologies analysing biological samples enabled the researchers to collect a huge amount of data. In particular, mass spectrometry-based proteomics uses the mass spectrometry to investigate proteins expressed in an organism or a cell. The manual inspection of spectra is unfeasible, so the need to introduce a set of algorithms, tools and platforms to manage and analyze them arises. Computational Proteomics regards the computational methods for analyzing spectra data in qualitative (i.e. peptide/protein identification in tandem mass spectrometry), and quantitative proteomics (i.e. protein expression in samples), as well as in biomarker discovery (i.e. the identification of a molecular signature of a disease directly from spectra). This chapter presents main standards, tools, and technologies for building scalable, reusable, and portable applications in this field. The chapter surveys available solutions for computational proteomics and includes a deep description of MS-Analyzer, a Grid-based software platform for the integrated management and analysis of spectra data. MS-Analyzer provides efficient spectra management through a specialized spectra database, and supports the semantic composition of pre-processing and data mining services to analyze spectra on the Grid.

Download Full-text

Determination of trace elements in bovine semen samples by inductively coupled plasma mass spectrometry and data mining techniques for identification of bovine class

Journal of Dairy Science ◽

10.3168/jds.2012-5515 ◽

2012 ◽

Vol 95 (12) ◽

pp. 7066-7073 ◽

Cited By ~ 15

Author(s):

G.F.M. Aguiar ◽

B.L. Batista ◽

J.L. Rodrigues ◽

L.R.S. Silva ◽

A.D. Campiglia ◽

...

Keyword(s):

Mass Spectrometry ◽

Data Mining ◽

Trace Elements ◽

Inductively Coupled Plasma ◽

Data Mining Techniques ◽

Bovine Semen ◽

Inductively Coupled ◽

Plasma Mass Spectrometry

Download Full-text

Student Performance Predictions Using Knowledge Discovery Database and Data Mining, DPU Students Records as Sample

Academic Journal of Nawroz University ◽

10.25007/ajnu.v10n3a875 ◽

2021 ◽

Vol 10 (3) ◽

pp. 121-127

Author(s):

Bareen Haval ◽

Karwan Jameel Abdulrahman ◽

Araz Rajab

Keyword(s):

Data Mining ◽

Decision Tree ◽

Student Performance ◽

Educational Data Mining ◽

Data Sets ◽

Decision Tree Classifier ◽

Data Mining Techniques ◽

Academic History ◽

Tree Classifier ◽

Using Data

This article presents the results of connecting an educational data mining techniques to the academic performance of students. Three classification models (Decision Tree, Random Forest and Deep Learning) have been developed to analyze data sets and predict the performance of students. The projected submission of the three classificatory was calculated and matched. The academic history and data of the students from the Office of the Registrar were used to train the models. Our analysis aims to evaluate the results of students using various variables such as the student's grade. Data from (221) students with (9) different attributes were used. The results of this study are very important, provide a better understanding of student success assessments and stress the importance of data mining in education. The main purpose of this study is to show the student successful forecast using data mining techniques to improve academic programs. The results of this research indicate that the Decision Tree classifier overtakes two other classifiers by achieving a total prediction accuracy of 97%.

Download Full-text

Data Mining Techniques for Identification and Classification of Various Diseases in Plants

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1110.1292s19 ◽

2019 ◽

Vol 9 (2S) ◽

pp. 676-680

Keyword(s):

Neural Network ◽

Data Mining ◽

Nearest Neighbors ◽

Crop Productivity ◽

Vital Role ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbors ◽

Data Mining Techniques

Data mining is currently being used in various applications; In research community it plays a vital role. This paper specify about data mining techniques for the preprocessing and classification of various disease in plants. Since various plants has different diseases based on that each of them has different data sets and different objectives for knowledge discovery. Data Mining Techniques applied on plants that it helps in segmentation and classification of diseased plants, it avoids Oral Inspection and helps to increase in crop productivity. This paper provides various classification techniques Such as K-Nearest Neighbors, Support Vector Machine, Principle component Analysis, Neural Network. Thus among various techniques neural network is effective for disease detection in plants.

Download Full-text

Data Privacy Preservation and Security Approaches for Sensitive Data in Big Data

10.3233/apc210221 ◽

2021 ◽

Author(s):

Rohit Ravindra Nikam ◽

Rekha Shahapurkar

Keyword(s):

Data Mining ◽

Data Analytics ◽

Data Privacy ◽

Privacy Preservation ◽

Large Data ◽

Research Area ◽

Data Sets ◽

Sensitive Information ◽

Sensitive Data ◽

Data Mining Techniques

Data mining is a technique that explores the necessary data is extracted from large data sets. Privacy protection of data mining is about hiding the sensitive information or identity of breach security or without losing data usability. Sensitive data contains confidential information about individuals, businesses, and governments who must not agree upon before sharing or publishing his privacy data. Conserving data mining privacy has become a critical research area. Various evaluation metrics such as performance in terms of time efficiency, data utility, and degree of complexity or resistance to data mining techniques are used to estimate the privacy preservation of data mining techniques. Social media and smart phones produce tons of data every minute. To decision making, the voluminous data produced from the different sources can be processed and analyzed. But data analytics are vulnerable to breaches of privacy. One of the data analytics frameworks is recommendation systems commonly used by e-commerce sites such as Amazon, Flip Kart to recommend items to customers based on their purchasing habits that lead to characterized. This paper presents various techniques of privacy conservation, such as data anonymization, data randomization, generalization, data permutation, etc. such techniques which existing researchers use. We also analyze the gap between various processes and privacy preservation methods and illustrate how to overcome such issues with new innovative methods. Finally, our research describes the outcome summary of the entire literature.

Download Full-text

Experimental and Data Analysis Considerations for Three-Dimensional Mass Spectrometry Imaging in Biomedical Research

Molecular Imaging and Biology ◽

10.1007/s11307-020-01541-5 ◽

2020 ◽

Author(s):

D. R. N. Vos ◽

S. R. Ellis ◽

B. Balluff ◽

R. M. A. Heeren

Keyword(s):

Mass Spectrometry ◽

Data Analysis ◽

Biomedical Research ◽

Mass Spectrometry Imaging ◽

Three Dimensional ◽

Sample Volume ◽

Biological Processes ◽

Full Sample ◽

Single Slice ◽

Molecular Complexity

Abstract Mass spectrometry imaging (MSI) enables the visualization of molecular distributions on complex surfaces. It has been extensively used in the field of biomedical research to investigate healthy and diseased tissues. Most of the MSI studies are conducted in a 2D fashion where only a single slice of the full sample volume is investigated. However, biological processes occur within a tissue volume and would ideally be investigated as a whole to gain a more comprehensive understanding of the spatial and molecular complexity of biological samples such as tissues and cells. Mass spectrometry imaging has therefore been expanded to the 3D realm whereby molecular distributions within a 3D sample can be visualized. The benefit of investigating volumetric data has led to a quick rise in the application of single-sample 3D-MSI investigations. Several experimental and data analysis aspects need to be considered to perform successful 3D-MSI studies. In this review, we discuss these aspects as well as ongoing developments that enable 3D-MSI to be routinely applied to multi-sample studies.

Download Full-text

Toward Reproducible Results from Targeted Metabolomic Studies: Perspectives for Data Pre-processing and a Basis for Analytic Pipeline Development

Current Topics in Medicinal Chemistry ◽

10.2174/1568026618666180711144323 ◽

2018 ◽

Vol 18 (11) ◽

pp. 883-895 ◽

Cited By ~ 6

Author(s):

Thomas Gross ◽

Mark Mapstone ◽

Ricardo Miramontes ◽

Robert Padilla ◽

Amrita K. Cheema ◽

...

Keyword(s):

Mass Spectrometry ◽

Best Practices ◽

Predictive Models ◽

Biomarker Discovery ◽

High Dimensional ◽

Data Sets ◽

Targeted Metabolomics ◽

Phenotypic Information ◽

Metabolomic Data ◽

Commercial Kit

Contemporary metabolomics experiments generate a rich array of complex high-dimensional data. Consequently, there have been concurrent efforts to develop methodological standards and analytical workflows to streamline the generation of meaningful biochemical and clinical inferences from raw data generated using an analytical platform like mass spectrometry. While such considerations have been frequently addressed in untargeted metabolomics (i.e., the broad survey of all distinguishable metabolites within a sample of interest), this methodological scrutiny has seldom been applied to data generated using commercial, targeted metabolomics kits. We suggest that this may, in part, account for past and more recent incomplete replications of previously specified biomarker panels. Herein, we identify common impediments challenging the analysis of raw, targeted metabolomic abundance data from a commercial kit and review methods to remedy these issues. In doing so, we propose an analytical pipeline suitable for the pre-processing of data for downstream biomarker discovery. Operational and statistical considerations for integrating targeted data sets across experimental sites and analytical batches are discussed, as are best practices for developing predictive models relating pre-processed metabolomic data to associated phenotypic information.

Download Full-text

Knowledge Structure and Data Mining Techniques

Knowledge Management ◽

10.4018/978-1-59904-933-5.ch072 ◽

2011 ◽

pp. 874-882

Author(s):

Rick L. Wilson ◽

Peter A. Rosen ◽

Mohammad Saad Al-Ahmadi

Keyword(s):

Data Mining ◽

Neural Networks ◽

Inductive Learning ◽

Knowledge Structure ◽

Statistical Techniques ◽

Data Sets ◽

Data Mining Technique ◽

Data Mining Techniques ◽

Mining Technique ◽

Learning Techniques

Considerable research has been done in the recent past that compares the performance of different data mining techniques on various data sets (e.g., Lim, Low, & Shih, 2000). The goal of these studies is to try to determine which data mining technique performs best under what circumstances. Results are often conflicting—for instance, some articles find that neural networks (NN) outperform both traditional statistical techniques and inductive learning techniques, but then the opposite is found with other datasets (Sen & Gibbs, 1994; Sung, Chang, & Lee, 1999: Spangler, May, & Vargas, 1999). Most of these studies use publicly available datasets in their analysis, and because they are not artificially created, it is difficult to control for possible data characteristics in the analysis. Another drawback of these datasets is that they are usually very small.

Download Full-text