scholarly journals Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Sahar K. Hussin ◽  
Salah M. Abdelmageid ◽  
Adel Alkhalil ◽  
Yasser M. Omar ◽  
Mahmoud I. Marie ◽  
...  

Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data’s imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.

2020 ◽  
Author(s):  
Uzair Bhatti

BACKGROUND In the era of health informatics, exponential growth of information generated by health information systems and healthcare organizations demands expert and intelligent recommendation systems. It has become one of the most valuable tools as it reduces problems such as information overload while selecting and suggesting doctors, hospitals, medicine, diagnosis etc according to patients’ interests. OBJECTIVE Recommendation uses Hybrid Filtering as one of the most popular approaches, but the major limitations of this approach are selectivity and data integrity issues.Mostly existing recommendation systems & risk prediction algorithms focus on a single domain, on the other end cross-domain hybrid filtering is able to alleviate the degree of selectivity and data integrity problems to a better extent. METHODS We propose a novel algorithm for recommendation & predictive model using KNN algorithm with machine learning algorithms and artificial intelligence (AI). We find the factors that directly impact on diseases and propose an approach for predicting the correct diagnosis of different diseases. We have constructed a series of models with good reliability for predicting different surgery complications and identified several novel clinical associations. We proposed a novel algorithm pr-KNN to use KNN for prediction and recommendation of diseases RESULTS Beside that we compared the performance of our algorithm with other machine algorithms and found better performance of our algorithm, with predictive accuracy improving by +3.61%. CONCLUSIONS The potential to directly integrate these predictive tools into EHRs may enable personalized medicine and decision-making at the point of care for patient counseling and as a teaching tool. CLINICALTRIAL dataset for the trials of patient attached


2021 ◽  
Author(s):  
Xiaotong Zhu ◽  
Jinhui Jeanne Huang

<p>Remote sensing monitoring has the characteristics of wide monitoring range, celerity, low cost for long-term dynamic monitoring of water environment. With the flourish of artificial intelligence, machine learning has enabled remote sensing inversion of seawater quality to achieve higher prediction accuracy. However, due to the physicochemical property of the water quality parameters, the performance of algorithms differs a lot. In order to improve the predictive accuracy of seawater quality parameters, we proposed a technical framework to identify the optimal machine learning algorithms using Sentinel-2 satellite and in-situ seawater sample data. In the study, we select three algorithms, i.e. support vector regression (SVR), XGBoost and deep learning (DL), and four seawater quality parameters, i.e. dissolved oxygen (DO), total dissolved solids (TDS), turbidity(TUR) and chlorophyll-a (Chla). The results show that SVR is a more precise algorithm to inverse DO (R<sup>2</sup> = 0.81). XGBoost has the best accuracy for Chla and Tur inversion (R<sup>2</sup> = 0.75 and 0.78 respectively) while DL performs better in TDS (R<sup>2</sup> =0.789). Overall, this research provides a theoretical support for high precision remote sensing inversion of offshore seawater quality parameters based on machine learning.</p>


2019 ◽  
Author(s):  
Donald Salami ◽  
Carla Alexandra Sousa ◽  
Maria do Rosário Oliveira Martins ◽  
César Capinha

ABSTRACTThe geographical spread of dengue is a global public health concern. This is largely mediated by the importation of dengue from endemic to non-endemic areas via the increasing connectivity of the global air transport network. The dynamic nature and intrinsic heterogeneity of the air transport network make it challenging to predict dengue importation.Here, we explore the capabilities of state-of-the-art machine learning algorithms to predict dengue importation. We trained four machine learning classifiers algorithms, using a 6-year historical dengue importation data for 21 countries in Europe and connectivity indices mediating importation and air transport network centrality measures. Predictive performance for the classifiers was evaluated using the area under the receiving operating characteristic curve, sensitivity, and specificity measures. Finally, we applied practical model-agnostic methods, to provide an in-depth explanation of our optimal model’s predictions on a global and local scale.Our best performing model achieved high predictive accuracy, with an area under the receiver operating characteristic score of 0.94 and a maximized sensitivity score of 0.88. The predictor variables identified as most important were the source country’s dengue incidence rate, population size, and volume of air passengers. Network centrality measures, describing the positioning of European countries within the air travel network, were also influential to the predictions.We demonstrated the high predictive performance of a machine learning model in predicting dengue importation and the utility of the model-agnostic methods to offer a comprehensive understanding of the reasons behind the predictions. Similar approaches can be utilized in the development of an operational early warning surveillance system for dengue importation.


2020 ◽  
Author(s):  
Pedro Ballester

Interest in docking technologies has grown parallel to the ever increasing number and diversity of 3D models for macromolecular therapeutic targets. Structure-Based Virtual Screening (SBVS) aims at leveraging these experimental structures to discover the necessary starting points for the drug discovery process. It is now established that Machine Learning (ML) can strongly enhance the predictive accuracy of scoring functions for SBVS by exploiting large datasets from targets, molecules and their associations. However, with greater choice, the question of which ML-based scoring function is the most suitable for prospective use on a given target has gained importance. Here we analyse two approaches to select an existing scoring function for the target along with a third approach consisting in generating a scoring function tailored to the target. These analyses required discussing the limitations of popular SBVS benchmarks, the alternatives to benchmark scoring functions for SBVS and how to generate them or use them using freely-available software.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Peter Appiahene ◽  
Yaw Marfo Missah ◽  
Ussiph Najim

The financial crisis that hit Ghana from 2015 to 2018 has raised various issues with respect to the efficiency of banks and the safety of depositors’ in the banking industry. As part of measures to improve the banking sector and also restore customers’ confidence, efficiency and performance analysis in the banking industry has become a hot issue. This is because stakeholders have to detect the underlying causes of inefficiencies within the banking industry. Nonparametric methods such as Data Envelopment Analysis (DEA) have been suggested in the literature as a good measure of banks’ efficiency and performance. Machine learning algorithms have also been viewed as a good tool to estimate various nonparametric and nonlinear problems. This paper presents a combined DEA with three machine learning approaches in evaluating bank efficiency and performance using 444 Ghanaian bank branches, Decision Making Units (DMUs). The results were compared with the corresponding efficiency ratings obtained from the DEA. Finally, the prediction accuracies of the three machine learning algorithm models were compared. The results suggested that the decision tree (DT) and its C5.0 algorithm provided the best predictive model. It had 100% accuracy in predicting the 134 holdout sample dataset (30% banks) and a P value of 0.00. The DT was followed closely by random forest algorithm with a predictive accuracy of 98.5% and a P value of 0.00 and finally the neural network (86.6% accuracy) with a P value 0.66. The study concluded that banks in Ghana can use the result of this study to predict their respective efficiencies. All experiments were performed within a simulation environment and conducted in R studio using R codes.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. 2581-2581 ◽  
Author(s):  
Paul Johannet ◽  
Nicolas Coudray ◽  
George Jour ◽  
Douglas MacArthur Donnelly ◽  
Shirin Bajaj ◽  
...  

2581 Background: There is growing interest in optimizing patient selection for treatment with immune checkpoint inhibitors (ICIs). We postulate that phenotypic features present in metastatic melanoma tissue reflect the biology of tumor cells, immune cells, and stromal tissue, and hence can provide predictive information about tumor behavior. Here, we test the hypothesis that machine learning algorithms can be trained to predict the likelihood of response and/or toxicity to ICIs. Methods: We examined 124 stage III/IV melanoma patients who received anti-CTLA-4 (n = 81), anti-PD-1 (n = 25), or combination (n = 18) therapy as first line. The tissue analyzed was resected before treatment with ICIs. In total, 340 H&E slides were digitized and annotated for three regions of interest: tumor, lymphocytes, and stroma. The slides were then partitioned into training (n = 285), validation (n = 26), and test (n = 29) sets. Slides were tiled (299x299 pixels) at 20X magnification. We trained a deep convolutional neural network (DCNN) to automatically segment the images into each of the three regions and then deconstruct images into their component features to detect non-obvious patterns with objectivity and reproducibility. We then trained the DCNN for two classifications: 1) complete/partial response versus progression of disease (POD), and 2) severe versus no immune-related adverse events (irAEs). Predictive accuracy was estimated by area under the curve (AUC) of receiver operating characteristics (ROC). Results: The DCNN identified tumor within LN with AUC 0.987 and within ST with AUC 0.943. Prediction of POD based on ST-only always performed better than prediction based on LN-only (AUC 0.84 compared to 0.61, respectively). The DCNN had an average AUC 0.69 when analyzing only tumor regions from both LN and ST data sets and AUC 0.68 when analyzing tumor and lymphocyte regions. Severe irAEs were predicted with limited accuracy (AUC 0.53). Conclusions: Our results support the potential application of machine learning on pre-treatment histologic slides to predict response to ICIs. It also revealed their limited value in predicting toxicity. We are currently investigating whether the predictive capability of the algorithm can be further improved by incorporating additional immunologic biomarkers.


2021 ◽  
Author(s):  
Young Chul Youn ◽  
Jung-Min Pyun ◽  
Hye Ryoun Kim ◽  
Sungmin Kang ◽  
Nayoung Ryoo ◽  
...  

Abstract Background: The Multimer Detection System-Oligomeric amyloid-β (MDS-OAβ) level is a valuable blood-based biomarker for Alzheimer’s disease (AD). We used machine learning algorithms trained using multi-center datasets to examine whether blood MDS-OAβ values can predict AD-associated changes in the brain.Methods: A logistic regression model using TensorFlow (ver. 2.3.0) was applied to data obtained from 163 participants (amyloid positron emission tomography [PET]-positive and -negative findings in 102 and 61 participants, respectively). Algorithms with various combinations of features (MDS-OAβ levels, age, gender, and anticoagulant type) were tested 50 times on each dataset. Results: The predictive accuracy, sensitivity, and specificity values of blood MDS-OAβ levels for amyloid PET positivity were 78.16±4.97%, 83.87±9.40%, and 70.00±13.13%, respectively.Conclusions: The findings from this multi-center machine learning-based study suggest that MDS-OAβ values may be used to predict amyloid PET-positivity.


2016 ◽  
Author(s):  
Andreas Keller ◽  
Richard C. Gerkin ◽  
Yuanfang Guan ◽  
Amit Dhurandhar ◽  
Gabor Turu ◽  
...  

AbstractDespite 25 years of progress in understanding the molecular mechanisms of olfaction, it is still not possible to predict whether a given molecule will have a perceived odor, or what olfactory percept it will produce. To address this stimulus-percept problem for olfaction, we organized the crowd-sourced DREAM Olfaction Prediction Challenge. Working from a large olfactory psychophysical dataset, teams developed machine learning algorithms to predict sensory attributes of molecules based on their chemoinformatic features. The resulting models predicted odor intensity and pleasantness with high accuracy, and also successfully predicted eight semantic descriptors (“garlic”, “fish”, “sweet”, “fruit”, “burnt”, “spices”, “flower”, “sour”). Regularized linear models performed nearly as well as random-forest-based approaches, with a predictive accuracy that closely approaches a key theoretical limit. The models presented here make it possible to predict the perceptual qualities of virtually any molecule with an impressive degree of accuracy to reverse-engineer the smell of a molecule.One Sentence SummaryResults of a crowdsourcing competition show that it is possible to accurately predict and reverse-engineer the smell of a molecule.


2020 ◽  
Author(s):  
Pedro Ballester

Interest in docking technologies has grown parallel to the ever increasing number and diversity of 3D models for macromolecular therapeutic targets. Structure-Based Virtual Screening (SBVS) aims at leveraging these experimental structures to discover the necessary starting points for the drug discovery process. It is now established that Machine Learning (ML) can strongly enhance the predictive accuracy of scoring functions for SBVS by exploiting large datasets from targets, molecules and their associations. However, with greater choice, the question of which ML-based scoring function is the most suitable for prospective use on a given target has gained importance. Here we analyse two approaches to select an existing scoring function for the target along with a third approach consisting in generating a scoring function tailored to the target. These analyses required discussing the limitations of popular SBVS benchmarks, the alternatives to benchmark scoring functions for SBVS and how to generate them or use them using freely-available software.


Sign in / Sign up

Export Citation Format

Share Document