CancerDiscover: A configurable pipeline for cancer prediction and biomarker identification using machine learning framework

Mapping Intimacies ◽

10.1101/182998 ◽

2017 ◽

Author(s):

Akram Mohammed ◽

Greyson Biegert ◽

Jiri Adamec ◽

Tomáš Helikar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Open Source ◽

High Throughput ◽

High Throughput Screening ◽

Learning Algorithms ◽

Supplementary Information ◽

Supplementary File ◽

Biomarker Identification ◽

Selection Algorithms

AbstractMotivationUse of various high-throughput screening techniques has resulted in an abundance of data, whose complete utility is limited by the tools available for processing and analysis. Machine learning holds great potential for deciphering these data in the context of cancer classification and biomarker identification. However, current machine learning tools require manual processing of raw data from various sequencing platforms, which is both tedious and time-consuming. The current classification tools lack flexibility in choosing the best feature selection algorithms from a range of algorithms and most importantly inability to compare various learning algorithms.ResultsWe developed CancerDiscover, an open-source software pipeline that allows users to efficiently and automatically integrate large high-throughput datasets, preprocess, normalize, and selects best performing features from multiple feature selection algorithms. The pipeline lets users apply various learning algorithms and generates multiple classification models and evaluation reports that distinguish cancer from normal samples, as well as different types and subtypes of cancer.Availability and ImplementationThe open source pipeline is freely available for download at https://github.com/HelikarLab/[email protected] InformationPlease refer to the CancerDiscover README (Supplementary File 1) for detailed instructions on installation and operation of the pipeline. For a list of available feature selection methods, see Supplementary File 2.

Download Full-text

A Hybrid Intelligent System Framework for the Prediction of Heart Disease Using Machine Learning Algorithms

Mobile Information Systems ◽

10.1155/2018/3860146 ◽

2018 ◽

Vol 2018 ◽

pp. 1-21 ◽

Cited By ~ 44

Author(s):

Amin Ul Haq ◽

Jian Ping Li ◽

Muhammad Hammad Memon ◽

Shah Nazir ◽

Ruinan Sun

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Performance Evaluation ◽

Heart Disease ◽

Execution Time ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Healthy People ◽

Validation Method ◽

Selection Algorithms

Heart disease is one of the most critical human diseases in the world and affects human life very badly. In heart disease, the heart is unable to push the required amount of blood to other parts of the body. Accurate and on time diagnosis of heart disease is important for heart failure prevention and treatment. The diagnosis of heart disease through traditional medical history has been considered as not reliable in many aspects. To classify the healthy people and people with heart disease, noninvasive-based methods such as machine learning are reliable and efficient. In the proposed study, we developed a machine-learning-based diagnosis system for heart disease prediction by using heart disease dataset. We used seven popular machine learning algorithms, three feature selection algorithms, the cross-validation method, and seven classifiers performance evaluation metrics such as classification accuracy, specificity, sensitivity, Matthews’ correlation coefficient, and execution time. The proposed system can easily identify and classify people with heart disease from healthy people. Additionally, receiver optimistic curves and area under the curves for each classifier was computed. We have discussed all of the classifiers, feature selection algorithms, preprocessing methods, validation method, and classifiers performance evaluation metrics used in this paper. The performance of the proposed system has been validated on full features and on a reduced set of features. The features reduction has an impact on classifiers performance in terms of accuracy and execution time of classifiers. The proposed machine-learning-based decision support system will assist the doctors to diagnosis heart patients efficiently.

Download Full-text

Improving the estimation of alpine grassland fractional vegetation cover using optimized algorithms and multi-dimensional features

Plant Methods ◽

10.1186/s13007-021-00796-5 ◽

2021 ◽

Vol 17 (1) ◽

Author(s):

Xingchen Lin ◽

Jianjun Chen ◽

Peiqing Lou ◽

Shuhua Yi ◽

Yu Qin ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Vegetation Cover ◽

Learning Algorithms ◽

Vegetation Indices ◽

Machine Learning Algorithms ◽

Alpine Grassland ◽

Quantitative Monitoring ◽

Selection Algorithms ◽

Selection Of

Abstract Background Fractional vegetation cover (FVC) is an important basic parameter for the quantitative monitoring of the alpine grassland ecosystem on the Qinghai-Tibetan Plateau. Based on unmanned aerial vehicle (UAV) acquisition of measured data and matching it with satellite remote sensing images at the pixel scale, the proper selection of driving data and inversion algorithms can be determined and is crucial for generating high-precision alpine grassland FVC products. Methods This study presents estimations of alpine grassland FVC using optimized algorithms and multi-dimensional features. The multi-dimensional feature set (using original spectral bands, 22 vegetation indices, and topographical factors) was constructed from many sources of information, then the optimal feature subset was determined based on different feature selection algorithms as the driving data for optimized machine learning algorithms. Finally, the inversion accuracy, sensitivity to sample size, and computational efficiency of the four machine learning algorithms were evaluated. Results (1) The random forest (RF) algorithm (R2: 0.861, RMSE: 9.5%) performed the best for FVC inversion among the four machine learning algorithms driven by the four typical vegetation indices. (2) Compared with the four typical vegetation indices, using multi-dimensional feature sets as driving data obviously improved the FVC inversion accuracy of the four machine learning algorithms (R2 of the RF algorithm increased to 0.890). (3) Among the three variable selection algorithms (Boruta, sequential forward selection [SFS], and permutation importance-recursive feature elimination [PI-RFE]), the constructed PI-RFE feature selection algorithm had the best dimensionality reduction effect on the multi-dimensional feature set. (4) The hyper-parameter optimization of the machine learning algorithms and feature selection of the multi-dimensional feature set further improved FVC inversion accuracy (R2: 0.917 and RMSE: 7.9% in the optimized RF algorithm). Conclusion This study provides a highly precise, optimized algorithm with an optimal multi-dimensional feature set for FVC inversion, which is vital for the quantitative monitoring of the ecological environment of alpine grassland.

Download Full-text

Feature Selection Algorithms as One of the Python Data Analytical Tools

Future Internet ◽

10.3390/fi12030054 ◽

2020 ◽

Vol 12 (3) ◽

pp. 54

Author(s):

Nikita Pilnenskiy ◽

Ivan Smetannikov

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Open Source ◽

Information Technologies ◽

Current Trend ◽

Modern State ◽

Machine Learning Applications ◽

Selection Algorithms ◽

Python Programming ◽

Analytical Tools

With the current trend of rapidly growing popularity of the Python programming language for machine learning applications, the gap between machine learning engineer needs and existing Python tools increases. Especially, it is noticeable for more classical machine learning fields, namely, feature selection, as the community attention in the last decade has mainly shifted to neural networks. This paper has two main purposes. First, we perform an overview of existing open-source Python and Python-compatible feature selection libraries, show their problems, if any, and demonstrate the gap between these libraries and the modern state of feature selection field. Then, we present new open-source scikit-learn compatible ITMO FS (Information Technologies, Mechanics and Optics University feature selection) library that is currently under development, explain how its architecture covers modern views on feature selection, and provide some code examples on how to use it with Python and its performance compared with other Python feature selection libraries.

Download Full-text

Critical Review of Processing and Classification Techniques for Images and Spectra in Microplastic Research

Applied Spectroscopy ◽

10.1177/0003702820929064 ◽

2020 ◽

Vol 74 (9) ◽

pp. 989-1010 ◽

Cited By ~ 7

Author(s):

Win Cowger ◽

Andrew Gray ◽

Silke H. Christiansen ◽

Hannah DeFrond ◽

Ashok D. Deshpande ◽

...

Keyword(s):

Machine Learning ◽

Image Analysis ◽

High Throughput ◽

High Throughput Screening ◽

Supplementary Information ◽

Gas Chromatography Mass Spectrometry ◽

Spectral Processing ◽

Double Blind ◽

Pyrolysis Gas ◽

Machine Learning Classification

Microplastic research is a rapidly developing field, with urgent needs for high throughput and automated analysis techniques. We conducted a review covering image analysis from optical microscopy, scanning electron microscopy, fluorescence microscopy, and spectral analysis from Fourier transform infrared (FT-IR) spectroscopy, Raman spectroscopy, pyrolysis gas–chromatography mass–spectrometry, and energy dispersive X-ray spectroscopy. These techniques were commonly used to collect, process, and interpret data from microplastic samples. This review outlined and critiques current approaches for analysis steps in image processing (color, thresholding, particle quantification), spectral processing (background and baseline subtraction, smoothing and noise reduction, data transformation), image classification (reference libraries, morphology, color, and fluorescence intensity), and spectral classification (reference libraries, matching procedures, and best practices for developing in-house reference tools). We highlighted opportunities to advance microplastic data analysis and interpretation by (i) quantifying colors, shapes, sizes, and surface topologies with image analysis software, (ii) identifying threshold values of particle characteristics in images that distinguish plastic particles from other particles, (iii) advancing spectral processing and classification routines, (iv) creating and sharing robust spectral libraries, (v) conducting double blind and negative controls, (vi) sharing raw data and analysis code, and (vii) leveraging readily available data to develop machine learning classification models. We identified analytical needs that we could fill and developed supplementary information for a reference library of plastic images and spectra, a tutorial for basic image analysis, and a code to download images from peer reviewed literature. Our major findings were that research on microplastics was progressing toward the use of multiple analytical methods and increasingly incorporating chemical classification. We suggest that new and repurposed methods need to be developed for high throughput screening using a diversity of approaches and highlight machine learning as one potential avenue toward this capability.

Download Full-text

Sentiment Analysis of Movie Reviews: A Study of Machine Learning Algorithms with Various Feature Selection Methods

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i9.113121 ◽

2017 ◽

Vol 5 (9) ◽

Cited By ~ 1

Author(s):

Rajwinder Kaur

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

Accelerating organic solar cell material's discovery: high-throughput screening and big data

Energy & Environmental Science ◽

10.1039/d1ee00559f ◽

2021 ◽

Author(s):

Xabier Rodríguez-Martínez ◽

Enrique Pascual-San-José ◽

Mariano Campoy-Quiles

Keyword(s):

Machine Learning ◽

Big Data ◽

High Throughput ◽

Organic Solar Cells ◽

High Throughput Screening ◽

Organic Solar Cell ◽

State Of The Art ◽

Review Article ◽

Machine Learning Algorithms ◽

Device Optimization

This review article presents the state-of-the-art in high-throughput computational and experimental screening routines with application in organic solar cells, including materials discovery, device optimization and machine-learning algorithms.

Download Full-text

Feature Selection with Fast Correlation-Based Filter for Breast Cancer Prediction and Classification Using Machine Learning Algorithms

2018 International Symposium on Advanced Electrical and Communication Technologies (ISAECT) ◽

10.1109/isaect.2018.8618688 ◽

2018 ◽

Author(s):

Youness Khourdifi ◽

Mohamed Bahaj

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Cancer Prediction

Download Full-text

Comparative study on total nitrogen prediction in wastewater treatment plant and effect of various feature selection methods on machine learning algorithms performance

Journal of Water Process Engineering ◽

10.1016/j.jwpe.2021.102033 ◽

2021 ◽

Vol 41 ◽

pp. 102033

Author(s):

Faramarz Bagherzadeh ◽

Mohamad-Javad Mehrani ◽

Milad Basirifard ◽

Javad Roostaei

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Wastewater Treatment ◽

Comparative Study ◽

Total Nitrogen ◽

Wastewater Treatment Plant ◽

Learning Algorithms ◽

Treatment Plant ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

A deep neural network model for packing density predictions and its application in the study of 1.5 million organic molecules

Chemical Science ◽

10.1039/c9sc02677k ◽

2019 ◽

Vol 10 (36) ◽

pp. 8374-8383 ◽

Cited By ~ 1

Author(s):

Mohammad Atif Faiz Afzal ◽

Aditya Sonpal ◽

Mojtaba Haghighatlari ◽

Andrew J. Schultz ◽

Johannes Hachmann

Keyword(s):

Neural Network ◽

Machine Learning ◽

Refractive Index ◽

High Throughput ◽

Neural Network Model ◽

High Throughput Screening ◽

Deep Neural Network ◽

Organic Molecules ◽

High Refractive Index ◽

Computational Pipeline

Computational pipeline for the accelerated discovery of organic materials with high refractive index via high-throughput screening and machine learning.

Download Full-text

Accurate estimation of isoelectric point of protein and peptide based on amino acid sequences

Bioinformatics ◽

10.1093/bioinformatics/btv674 ◽

2015 ◽

Vol 32 (6) ◽

pp. 821-827 ◽

Cited By ~ 19

Author(s):

Enrique Audain ◽

Yassel Ramos ◽

Henning Hermjakob ◽

Darren R. Flower ◽

Yasset Perez-Riverol

Keyword(s):

Machine Learning ◽

Isoelectric Point ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Basis Set ◽

Superior Performance ◽

Supplementary Information ◽

Training Dataset ◽

Accurate Estimation ◽

Prediction Methods

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text