scholarly journals CancerDiscover: A configurable pipeline for cancer prediction and biomarker identification using machine learning framework

2017 ◽  
Author(s):  
Akram Mohammed ◽  
Greyson Biegert ◽  
Jiri Adamec ◽  
Tomáš Helikar

AbstractMotivationUse of various high-throughput screening techniques has resulted in an abundance of data, whose complete utility is limited by the tools available for processing and analysis. Machine learning holds great potential for deciphering these data in the context of cancer classification and biomarker identification. However, current machine learning tools require manual processing of raw data from various sequencing platforms, which is both tedious and time-consuming. The current classification tools lack flexibility in choosing the best feature selection algorithms from a range of algorithms and most importantly inability to compare various learning algorithms.ResultsWe developed CancerDiscover, an open-source software pipeline that allows users to efficiently and automatically integrate large high-throughput datasets, preprocess, normalize, and selects best performing features from multiple feature selection algorithms. The pipeline lets users apply various learning algorithms and generates multiple classification models and evaluation reports that distinguish cancer from normal samples, as well as different types and subtypes of cancer.Availability and ImplementationThe open source pipeline is freely available for download at https://github.com/HelikarLab/[email protected] InformationPlease refer to the CancerDiscover README (Supplementary File 1) for detailed instructions on installation and operation of the pipeline. For a list of available feature selection methods, see Supplementary File 2.

2018 ◽  
Vol 2018 ◽  
pp. 1-21 ◽  
Author(s):  
Amin Ul Haq ◽  
Jian Ping Li ◽  
Muhammad Hammad Memon ◽  
Shah Nazir ◽  
Ruinan Sun

Heart disease is one of the most critical human diseases in the world and affects human life very badly. In heart disease, the heart is unable to push the required amount of blood to other parts of the body. Accurate and on time diagnosis of heart disease is important for heart failure prevention and treatment. The diagnosis of heart disease through traditional medical history has been considered as not reliable in many aspects. To classify the healthy people and people with heart disease, noninvasive-based methods such as machine learning are reliable and efficient. In the proposed study, we developed a machine-learning-based diagnosis system for heart disease prediction by using heart disease dataset. We used seven popular machine learning algorithms, three feature selection algorithms, the cross-validation method, and seven classifiers performance evaluation metrics such as classification accuracy, specificity, sensitivity, Matthews’ correlation coefficient, and execution time. The proposed system can easily identify and classify people with heart disease from healthy people. Additionally, receiver optimistic curves and area under the curves for each classifier was computed. We have discussed all of the classifiers, feature selection algorithms, preprocessing methods, validation method, and classifiers performance evaluation metrics used in this paper. The performance of the proposed system has been validated on full features and on a reduced set of features. The features reduction has an impact on classifiers performance in terms of accuracy and execution time of classifiers. The proposed machine-learning-based decision support system will assist the doctors to diagnosis heart patients efficiently.


Plant Methods ◽  
2021 ◽  
Vol 17 (1) ◽  
Author(s):  
Xingchen Lin ◽  
Jianjun Chen ◽  
Peiqing Lou ◽  
Shuhua Yi ◽  
Yu Qin ◽  
...  

Abstract Background Fractional vegetation cover (FVC) is an important basic parameter for the quantitative monitoring of the alpine grassland ecosystem on the Qinghai-Tibetan Plateau. Based on unmanned aerial vehicle (UAV) acquisition of measured data and matching it with satellite remote sensing images at the pixel scale, the proper selection of driving data and inversion algorithms can be determined and is crucial for generating high-precision alpine grassland FVC products. Methods This study presents estimations of alpine grassland FVC using optimized algorithms and multi-dimensional features. The multi-dimensional feature set (using original spectral bands, 22 vegetation indices, and topographical factors) was constructed from many sources of information, then the optimal feature subset was determined based on different feature selection algorithms as the driving data for optimized machine learning algorithms. Finally, the inversion accuracy, sensitivity to sample size, and computational efficiency of the four machine learning algorithms were evaluated. Results (1) The random forest (RF) algorithm (R2: 0.861, RMSE: 9.5%) performed the best for FVC inversion among the four machine learning algorithms driven by the four typical vegetation indices. (2) Compared with the four typical vegetation indices, using multi-dimensional feature sets as driving data obviously improved the FVC inversion accuracy of the four machine learning algorithms (R2 of the RF algorithm increased to 0.890). (3) Among the three variable selection algorithms (Boruta, sequential forward selection [SFS], and permutation importance-recursive feature elimination [PI-RFE]), the constructed PI-RFE feature selection algorithm had the best dimensionality reduction effect on the multi-dimensional feature set. (4) The hyper-parameter optimization of the machine learning algorithms and feature selection of the multi-dimensional feature set further improved FVC inversion accuracy (R2: 0.917 and RMSE: 7.9% in the optimized RF algorithm). Conclusion This study provides a highly precise, optimized algorithm with an optimal multi-dimensional feature set for FVC inversion, which is vital for the quantitative monitoring of the ecological environment of alpine grassland.


2020 ◽  
Vol 12 (3) ◽  
pp. 54
Author(s):  
Nikita Pilnenskiy ◽  
Ivan Smetannikov

With the current trend of rapidly growing popularity of the Python programming language for machine learning applications, the gap between machine learning engineer needs and existing Python tools increases. Especially, it is noticeable for more classical machine learning fields, namely, feature selection, as the community attention in the last decade has mainly shifted to neural networks. This paper has two main purposes. First, we perform an overview of existing open-source Python and Python-compatible feature selection libraries, show their problems, if any, and demonstrate the gap between these libraries and the modern state of feature selection field. Then, we present new open-source scikit-learn compatible ITMO FS (Information Technologies, Mechanics and Optics University feature selection) library that is currently under development, explain how its architecture covers modern views on feature selection, and provide some code examples on how to use it with Python and its performance compared with other Python feature selection libraries.


2020 ◽  
Vol 74 (9) ◽  
pp. 989-1010 ◽  
Author(s):  
Win Cowger ◽  
Andrew Gray ◽  
Silke H. Christiansen ◽  
Hannah DeFrond ◽  
Ashok D. Deshpande ◽  
...  

Microplastic research is a rapidly developing field, with urgent needs for high throughput and automated analysis techniques. We conducted a review covering image analysis from optical microscopy, scanning electron microscopy, fluorescence microscopy, and spectral analysis from Fourier transform infrared (FT-IR) spectroscopy, Raman spectroscopy, pyrolysis gas–chromatography mass–spectrometry, and energy dispersive X-ray spectroscopy. These techniques were commonly used to collect, process, and interpret data from microplastic samples. This review outlined and critiques current approaches for analysis steps in image processing (color, thresholding, particle quantification), spectral processing (background and baseline subtraction, smoothing and noise reduction, data transformation), image classification (reference libraries, morphology, color, and fluorescence intensity), and spectral classification (reference libraries, matching procedures, and best practices for developing in-house reference tools). We highlighted opportunities to advance microplastic data analysis and interpretation by (i) quantifying colors, shapes, sizes, and surface topologies with image analysis software, (ii) identifying threshold values of particle characteristics in images that distinguish plastic particles from other particles, (iii) advancing spectral processing and classification routines, (iv) creating and sharing robust spectral libraries, (v) conducting double blind and negative controls, (vi) sharing raw data and analysis code, and (vii) leveraging readily available data to develop machine learning classification models. We identified analytical needs that we could fill and developed supplementary information for a reference library of plastic images and spectra, a tutorial for basic image analysis, and a code to download images from peer reviewed literature. Our major findings were that research on microplastics was progressing toward the use of multiple analytical methods and increasingly incorporating chemical classification. We suggest that new and repurposed methods need to be developed for high throughput screening using a diversity of approaches and highlight machine learning as one potential avenue toward this capability.


Author(s):  
Xabier Rodríguez-Martínez ◽  
Enrique Pascual-San-José ◽  
Mariano Campoy-Quiles

This review article presents the state-of-the-art in high-throughput computational and experimental screening routines with application in organic solar cells, including materials discovery, device optimization and machine-learning algorithms.


2019 ◽  
Vol 10 (36) ◽  
pp. 8374-8383 ◽  
Author(s):  
Mohammad Atif Faiz Afzal ◽  
Aditya Sonpal ◽  
Mojtaba Haghighatlari ◽  
Andrew J. Schultz ◽  
Johannes Hachmann

Computational pipeline for the accelerated discovery of organic materials with high refractive index via high-throughput screening and machine learning.


2015 ◽  
Vol 32 (6) ◽  
pp. 821-827 ◽  
Author(s):  
Enrique Audain ◽  
Yassel Ramos ◽  
Henning Hermjakob ◽  
Darren R. Flower ◽  
Yasset Perez-Riverol

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document