The Training Set Selection Methods of microRNA Precursors Prediction Based on Machine Learning Approaches

A Hybrid Alchemical Free Energy and Machine Learning Methodology for the Calculation of Absolute Hydration Free Energies of Small Molecules

10.26434/chemrxiv.12380612 ◽

2020 ◽

Author(s):

Jenke Scheen ◽

Wilson Wu ◽

Antonia S. J. S. Mey ◽

Paolo Tosco ◽

Mark Mackey ◽

...

Keyword(s):

Machine Learning ◽

Free Energy ◽

Free Energy Calculations ◽

Learning Approaches ◽

Free Energies ◽

Training Set ◽

Set Size ◽

Correction Terms ◽

Alchemical Free Energy ◽

Alchemical Free Energy Calculations

A methodology that combines alchemical free energy calculations (FEP) with machine learning (ML) has been developed to compute accurate absolute hydration free energies. The hybrid FEP/ML methodology was trained on a subset of the FreeSolv database, and retrospectively shown to outperform most submissions from the SAMPL4 competition. Compared to pure machine-learning approaches, FEP/ML yields more precise estimates of free energies of hydration, and requires a fraction of the training set size to outperform standalone FEP calculations. The ML-derived correction terms are further shown to be transferable to a range of related FEP simulation protocols. The approach may be used to inexpensively improve the accuracy of FEP calculations, and to flag molecules which will benefit the most from bespoke forcefield parameterisation efforts.

Download Full-text

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

Download Full-text

New Entropy Based Distance for Training Set Selection in Debt Portfolio Valuation

International Journal of Information Technology and Web Engineering ◽

10.4018/jitwe.2012040105 ◽

2012 ◽

Vol 7 (2) ◽

pp. 60-69

Author(s):

Tomasz Kajdanowicz ◽

Slawomir Plamowski ◽

Przemyslaw Kazienko

Keyword(s):

Machine Learning ◽

Distance Measure ◽

Prediction Performance ◽

Training Set ◽

Learning Tasks ◽

Real Domain ◽

Valuation Process ◽

Proper Training ◽

Training Set Selection ◽

Portfolio Valuation

Choosing a proper training set for machine learning tasks is of great importance in complex domain problems. In the paper a new distance measure for training set selection is presented and thoroughly discussed. The distance between two datasets is computed using variance of entropy in groups obtained after clustering. The approach is validated using real domain datasets from debt portfolio valuation process. Eventually, prediction performance is examined.

Download Full-text

A Hybrid Vision-Map Method for Urban Road Detection

Journal of Advanced Transportation ◽

10.1155/2017/7090549 ◽

2017 ◽

Vol 2017 ◽

pp. 1-21 ◽

Cited By ~ 6

Author(s):

Carlos Fernández ◽

David Fernández-Llorca ◽

Miguel A. Sotelo

Keyword(s):

Machine Learning ◽

Urban Environments ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Classification Problems ◽

Road Detection ◽

Training Set ◽

Digital Maps ◽

The Road ◽

Learning Techniques

A hybrid vision-map system is presented to solve the road detection problem in urban scenarios. The standardized use of machine learning techniques in classification problems has been merged with digital navigation map information to increase system robustness. The objective of this paper is to create a new environment perception method to detect the road in urban environments, fusing stereo vision with digital maps by detecting road appearance and road limits such as lane markings or curbs. Deep learning approaches make the system hard-coupled to the training set. Even though our approach is based on machine learning techniques, the features are calculated from different sources (GPS, map, curbs, etc.), making our system less dependent on the training set.

Download Full-text

Diagnostic Performance of 2D and 3D T2WI-Based Radiomics Features With Machine Learning Algorithms to Distinguish Solid Solitary Pulmonary Lesion

Frontiers in Oncology ◽

10.3389/fonc.2021.683587 ◽

2021 ◽

Vol 11 ◽

Author(s):

Qi Wan ◽

Jiaxuan Zhou ◽

Xiaoying Xia ◽

Jianfeng Hu ◽

Peng Wang ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Diagnostic Performance ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approaches ◽

Selection Methods ◽

Linear Discriminant ◽

2D And 3D

ObjectiveTo evaluate the performance of 2D and 3D radiomics features with different machine learning approaches to classify SPLs based on magnetic resonance(MR) T2 weighted imaging (T2WI).Material and MethodsA total of 132 patients with pathologically confirmed SPLs were examined and randomly divided into training (n = 92) and test datasets (n = 40). A total of 1692 3D and 1231 2D radiomics features per patient were extracted. Both radiomics features and clinical data were evaluated. A total of 1260 classification models, comprising 3 normalization methods, 2 dimension reduction algorithms, 3 feature selection methods, and 10 classifiers with 7 different feature numbers (confined to 3–9), were compared. The ten-fold cross-validation on the training dataset was applied to choose the candidate final model. The area under the receiver operating characteristic curve (AUC), precision-recall plot, and Matthews Correlation Coefficient were used to evaluate the performance of machine learning approaches.ResultsThe 3D features were significantly superior to 2D features, showing much more machine learning combinations with AUC greater than 0.7 in both validation and test groups (129 vs. 11). The feature selection method Analysis of Variance(ANOVA), Recursive Feature Elimination(RFE) and the classifier Logistic Regression(LR), Linear Discriminant Analysis(LDA), Support Vector Machine(SVM), Gaussian Process(GP) had relatively better performance. The best performance of 3D radiomics features in the test dataset (AUC = 0.824, AUC-PR = 0.927, MCC = 0.514) was higher than that of 2D features (AUC = 0.740, AUC-PR = 0.846, MCC = 0.404). The joint 3D and 2D features (AUC=0.813, AUC-PR = 0.926, MCC = 0.563) showed similar results as 3D features. Incorporating clinical features with 3D and 2D radiomics features slightly improved the AUC to 0.836 (AUC-PR = 0.918, MCC = 0.620) and 0.780 (AUC-PR = 0.900, MCC = 0.574), respectively.ConclusionsAfter algorithm optimization, 2D feature-based radiomics models yield favorable results in differentiating malignant and benign SPLs, but 3D features are still preferred because of the availability of more machine learning algorithmic combinations with better performance. Feature selection methods ANOVA and RFE, and classifier LR, LDA, SVM and GP are more likely to demonstrate better diagnostic performance for 3D features in the current study.

Download Full-text

Predicting clinically significant motor function improvement after contemporary task-oriented interventions using machine learning approaches

Journal of NeuroEngineering and Rehabilitation ◽

10.1186/s12984-020-00758-3 ◽

2020 ◽

Vol 17 (1) ◽

Author(s):

Hiren Kumar Thakkar ◽

Wan-wen Liao ◽

Ching-yi Wu ◽

Yu-Wei Hsieh ◽

Tsong-Hai Lee

Keyword(s):

Machine Learning ◽

Motor Function ◽

Chronic Stroke ◽

Learning Approaches ◽

Stroke Patients ◽

Training Set ◽

Data Set ◽

Clinically Significant ◽

Task Oriented ◽

Significant Motor

Abstract Background Accurate prediction of motor recovery after stroke is critical for treatment decisions and planning. Machine learning has been proposed to be a promising technique for outcome prediction because of its high accuracy and ability to process large volumes of data. It has been used to predict acute stroke recovery; however, whether machine learning would be effective for predicting rehabilitation outcomes in chronic stroke patients for common contemporary task-oriented interventions remains largely unexplored. This study aimed to determine the accuracy and performance of machine learning to predict clinically significant motor function improvements after contemporary task-oriented intervention in chronic stroke patients and identify important predictors for building machine learning prediction models. Methods This study was a secondary analysis of data using two common machine learning approaches, which were the k-nearest neighbor (KNN) and artificial neural network (ANN). Chronic stroke patients (N = 239) that received 30 h of task-oriented training including the constraint-induced movement therapy, bilateral arm training, robot-assisted therapy and mirror therapy were included. The Fugl-Meyer assessment scale (FMA) was the main outcome. Potential predictors include age, gender, side of lesion, time since stroke, baseline functional status, motor function and quality of life. We divided the data set into a training set and a test set and used the cross-validation procedure to construct machine learning models based on the training set. After the models were built, we used the test data set to evaluate the accuracy and prediction performance of the models. Results Three important predictors were identified, which were time since stroke, baseline functional independence measure (FIM) and baseline FMA scores. Models for predicting motor function improvements were accurate. The prediction accuracy of the KNN model was 85.42% and area under the receiver operating characteristic curve (AUC-ROC) was 0.89. The prediction accuracy of the ANN model was 81.25% and the AUC-ROC was 0.77. Conclusions Incorporating machine learning into clinical outcome prediction using three key predictors including time since stroke, baseline functional and motor ability may help clinicians/therapists to identify patients that are most likely to benefit from contemporary task-oriented interventions. The KNN and ANN models may be potentially useful for predicting clinically significant motor recovery in chronic stroke.

Download Full-text

Using Machine Learning via Deep Learning Algorithms to Diagnose the Lung Disease Based on Chest Imaging: A Survey

International Journal of Interactive Mobile Technologies (iJIM) ◽

10.3991/ijim.v15i16.24191 ◽

2021 ◽

Vol 15 (16) ◽

pp. 95

Author(s):

Shaymaa Taha Ahmed ◽

Suhad Malallah Kadhem

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Lung Disease ◽

Lung Diseases ◽

Learning Algorithms ◽

Learning Approaches ◽

Chest Imaging ◽

Training Set ◽

X Ray ◽

State Of Art

— Chest imaging diagnostics is crucial in the medical area due to many serious lung diseases like cancers and nodules and particularly with the current pandemic of Covid-19. Machine learning approaches yield prominent results toward the task of diagnosis. Recently, deep learning methods are utilized and recommended by many studies in this domain. The research aims to critically examine the newest lung disease detection procedures using deep learning algorithms that use X-ray and CT scan datasets. Here, the most recent studies in this area (2015-2021) have been reviewed and summarized to provide an overview of the most appropriate methods that should be used or developed in future works, what limitations should be considered, and at what level these techniques help physicians in identifying the disease with better accuracy. The lack of various standard datasets, the huge training set, the high dimensionality of data, and the independence of features have been the main limitations based on the literature. However, different architectures of deep learning are used by many researchers but, Convolutional Neural Networks (CNN) are still state-of-art techniques in dealing with image datasets.

Download Full-text

A Combined DFT/Machine Learning Framework for Materials Discovery: Application to Spinels and Assessment of Search Completeness and Efficiency

10.26434/chemrxiv.13070549.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Elif Ertekin ◽

Joshua A. Schiller

Keyword(s):

Machine Learning ◽

Physical Modeling ◽

Density Functional ◽

Search Space ◽

Supervised Machine Learning ◽

Search Efficiency ◽

Learning Approaches ◽

Training Set ◽

Learning Framework ◽

Single Feature

It is challenging to evaluate machine learning approaches developed for accelerating materials search and discovery in a realistic way. Machine learning approaches to materials stability prediction are typically assessed by their ability to reproduce results from direct physical modeling, whereas ideally both machine learning and direct physical modeling should be assessed by their ability to reproduce reality. Additionally, traditional evaluation metrics do not directly reflect the experience of an experimental search for unknown compounds in a large candidate phase space, and often result in overly optimistic assessments. Here, we (i) present a framework that combines density functional theory and traditional supervised machine learning methods (ML/DFT), and (ii) introduce the concepts of search completeness – the fraction of discoverable compounds found relative to the fraction of search space explored – and search efficiency – the rate of discovery relative to the fraction of search space explored – to evaluate it. The ML/DFT framework is an iterative approach to predict stable chemistries of a fixed crystal structure (here, spinels) that uses DFT to generate a training set of unstable compounds. The training set of stable compounds is given by experimentally known spinels. The method is carried out using random forest, LASSO, and ridge regression to predict as-of-yet undiscovered spinel chemistries. TreeSHAP analysis is used to determine features that most contribute to stability/instability classification. While no single feature dominates, several emerge that align with chemical intuition. To estimate the efficacy of ML/DFT compared to pure DFT, we introduce a Bayesian description of DFT distribution of energies for stable and unstable spinels. The Bayesian model enables quantifying the search completeness and search efficiency of DFT, which is then compared to that of ML/DFT. ML/DFT achieves search completeness and efficiency on par with pure DFT, despite requiring fewer DFT simulations (∼300 vs. 14,200). More importantly, by quantitatively assessing ML approaches in ways that better reflect how they would be used in materials discovery experiments, we obtain key insights into the challenges that need to be overcome by such methods: that the small number of stable compounds to be found in a search space orders of magnitude larger places stringent demands on model accuracy to achieve good search efficiency. Finally, we report the top candidates of our spinel search, which may be of interest for synthesis experiments

Download Full-text

A Hybrid Alchemical Free Energy and Machine Learning Methodology for the Calculation of Absolute Hydration Free Energies of Small Molecules

10.26434/chemrxiv.12380612.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jenke Scheen ◽

Wilson Wu ◽

Antonia S. J. S. Mey ◽

Paolo Tosco ◽

Mark Mackey ◽

...

Keyword(s):

Machine Learning ◽

Free Energy ◽

Free Energy Calculations ◽

Learning Approaches ◽

Free Energies ◽

Training Set ◽

Set Size ◽

Correction Terms ◽

Alchemical Free Energy ◽

Alchemical Free Energy Calculations

A methodology that combines alchemical free energy calculations (FEP) with machine learning (ML) has been developed to compute accurate absolute hydration free energies. The hybrid FEP/ML methodology was trained on a subset of the FreeSolv database, and retrospectively shown to outperform most submissions from the SAMPL4 competition. Compared to pure machine-learning approaches, FEP/ML yields more precise estimates of free energies of hydration, and requires a fraction of the training set size to outperform standalone FEP calculations. The ML-derived correction terms are further shown to be transferable to a range of related FEP simulation protocols. The approach may be used to inexpensively improve the accuracy of FEP calculations, and to flag molecules which will benefit the most from bespoke forcefield parameterisation efforts.

Download Full-text

Computational methods for training set selection and error assessment applied to catalyst design: guidelines for deciding which reactions to run first and which to run next

Reaction Chemistry & Engineering ◽

10.1039/d1re00013f ◽

2021 ◽

Author(s):

Andrew F. Zahrt ◽

Brennan T. Rose ◽

William T. Darrow ◽

Jeremy J. Henle ◽

Scott E. Denmark

Keyword(s):

In Silico ◽

Subset Selection ◽

Design Guidelines ◽

Catalyst Design ◽

Error Assessment ◽

Selection Methods ◽

Training Set ◽

Training Set Selection ◽

Catalyst Selection ◽

Selection Of

Different subset selection methods are examined to guide catalyst selection in optimization campaigns. Error assessment methods are used to quantitatively inform selection of new catalyst candidates from in silico libraries of catalyst structures.

Download Full-text