Classification and Design of HIV-1 Integrase Inhibitors Based on Machine Learning

Computational and Mathematical Methods in Medicine ◽

10.1155/2021/5559338 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Junlin Zhou ◽

Juan Hao ◽

Lianxin Peng ◽

Huaichuan Duan ◽

Qing Luo ◽

...

Keyword(s):

Machine Learning ◽

Molecular Descriptors ◽

Specific Activity ◽

Training Set ◽

Molecular Fingerprint ◽

Activity Data ◽

Prediction Ability ◽

Test Set ◽

Molecular Fingerprints ◽

Hiv 1

A key enzyme in human immunodeficiency virus type 1 (HIV-1) life cycle, integrase (IN) aids the integration of viral DNA into the host DNA, which has become an ideal target for the development of anti-HIV drugs. A total of 1785 potential HIV-1 IN inhibitors were collected from the databases of ChEMBL, Binding Database, DrugBank, and PubMed, as well as from 40 references. The database was divided into the training set and test set by random sampling. By exploring the correlation between molecular descriptors and inhibitory activity, it is found that the classification and specific activity data of inhibitors can be more accurately predicted by the combination of molecular descriptors and molecular fingerprints. The calculation of molecular fingerprint descriptor provides the additional substructure information to improve the prediction ability. Based on the training set, two machine learning methods, the recursive partition (RP) and naive Bayes (NB) models, were used to build the classifiers of HIV-1 IN inhibitors. Through the test set verification, the RP technique accurately predicted 82.5% inhibitors and 86.3% noninhibitors. The NB model predicted 88.3% inhibitors and 87.2% noninhibitors with correlation coefficient of 85.2%. The results show that the prediction performance of NB model is slightly better than that of RP, and the key molecular segments are also obtained. Additionally, CoMFA and CoMSIA models with good activity prediction ability both were constructed by exploring the structure-activity relationship, which is helpful for the design and optimization of HIV-1 IN inhibitors.

Download Full-text

Exploratory analysis on prediction of loan privilege for customers using random forest

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.21.12399 ◽

2018 ◽

Vol 7 (2.21) ◽

pp. 339 ◽

Cited By ~ 1

Author(s):

K Ulaga Priya ◽

S Pushpa ◽

K Kalaivani ◽

A Sartiha

Keyword(s):

Machine Learning ◽

Random Forest ◽

Data Model ◽

Model Evaluation ◽

Banking Industry ◽

Performance Parameters ◽

Training Set ◽

Test Set ◽

Learning Technique ◽

Analytical Processing

In Banking Industry loan Processing is a tedious task in identifying the default customers. Manual prediction of default customers might turn into a bad loan in future. Banks possess huge volume of behavioral data from which they are unable to make a judgement about prediction of loan defaulters. Modern techniques like Machine Learning will help to do analytical processing using Supervised Learning and Unsupervised Learning Technique. A data model for predicting default customers using Random forest Technique has been proposed. Data model Evaluation is done on training set and based on the performance parameters final prediction is done on the Test set. This is an evident that Random Forest technique will help the bank to predict the loan Defaulters with utmost accuracy.

Download Full-text

Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement

Information ◽

10.3390/info11060332 ◽

2020 ◽

Vol 11 (6) ◽

pp. 332

Author(s):

Ernest Kwame Ampomah ◽

Zhiguang Qin ◽

Gabriel Nyame

Keyword(s):

Machine Learning ◽

Stock Market ◽

Stock Price ◽

Superior Performance ◽

Operating Characteristics ◽

Training Set ◽

Data Set ◽

Test Set ◽

Ensemble Machine Learning ◽

Better Than

Forecasting the direction and trend of stock price is an important task which helps investors to make prudent financial decisions in the stock market. Investment in the stock market has a big risk associated with it. Minimizing prediction error reduces the investment risk. Machine learning (ML) models typically perform better than statistical and econometric models. Also, ensemble ML models have been shown in the literature to be able to produce superior performance than single ML models. In this work, we compare the effectiveness of tree-based ensemble ML models (Random Forest (RF), XGBoost Classifier (XG), Bagging Classifier (BC), AdaBoost Classifier (Ada), Extra Trees Classifier (ET), and Voting Classifier (VC)) in forecasting the direction of stock price movement. Eight different stock data from three stock exchanges (NYSE, NASDAQ, and NSE) are randomly collected and used for the study. Each data set is split into training and test set. Ten-fold cross validation accuracy is used to evaluate the ML models on the training set. In addition, the ML models are evaluated on the test set using accuracy, precision, recall, F1-score, specificity, and area under receiver operating characteristics curve (AUC-ROC). Kendall W test of concordance is used to rank the performance of the tree-based ML algorithms. For the training set, the AdaBoost model performed better than the rest of the models. For the test set, accuracy, precision, F1-score, and AUC metrics generated results significant to rank the models, and the Extra Trees classifier outperformed the other models in all the rankings.

Download Full-text

An application of machine learning to assist medication order review by pharmacists in a health care center

10.1101/19013029 ◽

2019 ◽

Author(s):

Maxime Thibault ◽

Denis Lebel

Keyword(s):

Machine Learning ◽

Neonatal Intensive Care ◽

Care Center ◽

Variable Order ◽

Medication Order ◽

Training Set ◽

Test Set ◽

Health Care Center ◽

Clinical Benefits ◽

Fold Cross Validation

AbstractThe objective of this study was to determine if it is feasible to use machine learning to evaluate how a medication order is contextually appropriate for a patient, in order to assist order review by pharmacists. A neural network was constructed using as input the sequence of word2vec embeddings of the 30 previous orders, as well as the currently active medications, pharmacological classes and ordering department, to predict the next order. The model was trained with data from 2013 to 2017, optimized using 5-fold cross-validation, and tested on orders from 2018. A survey was developed to obtain pharmacist ratings on a sample of 20 orders, which were compared with predictions. The training set included 1 022 272 orders. The test set included 95 310 orders. Baseline training set top 1, top 10 and top 30 accuracy using a dummy classifier were respectively 4.5%, 23.6% and 44.1%. Final test set accuracies were, respectively, 44.4%, 69.9% and 80.4%. Populations in which the model performed the best were obstetrics and gynecology patients and newborn babies (either in or out of neonatal intensive care). Pharmacists agreed poorly on their ratings of sampled orders with a Fleiss kappa of 0.283. The breakdown of metrics by population showed better performance in patients following less variable order patterns, indicating potential usefulness in triaging routine orders to less extensive pharmacist review. We conclude that machine learning has potential for helping pharmacists review medication orders. Future studies should aim at evaluating the clinical benefits of using such a model in practice.

Download Full-text

In silico prediction of chemical neurotoxicity using machine learning

Toxicology Research ◽

10.1093/toxres/tfaa016 ◽

2020 ◽

Vol 9 (3) ◽

pp. 164-172

Author(s):

Changsheng Jiang ◽

Piaopiao Zhao ◽

Weihua Li ◽

Yun Tang ◽

Guixia Liu

Keyword(s):

Machine Learning ◽

Regression Models ◽

Cross Validation ◽

Prediction Models ◽

Drug Withdrawal ◽

Molecular Descriptors ◽

Computational Prediction ◽

Machine Learning Algorithms ◽

Training Set ◽

Data Set

Abstract Neurotoxicity is one of the main causes of drug withdrawal, and the biological experimental methods of detecting neurotoxic toxicity are time-consuming and laborious. In addition, the existing computational prediction models of neurotoxicity still have some shortcomings. In response to these shortcomings, we collected a large number of data set of neurotoxicity and used PyBioMed molecular descriptors and eight machine learning algorithms to construct regression prediction models of chemical neurotoxicity. Through the cross-validation and test set validation of the models, it was found that the extra-trees regressor model had the best predictive effect on neurotoxicity (${q}_{\mathrm{test}}^2$ = 0.784). In addition, we get the applicability domain of the models by calculating the standard deviation distance and the lever distance of the training set. We also found that some molecular descriptors are closely related to neurotoxicity by calculating the contribution of the molecular descriptors to the models. Considering the accuracy of the regression models, we recommend using the extra-trees regressor model to predict the chemical autonomic neurotoxicity.

Download Full-text

Prediction of Genotype Positivity in Patients with Hypertrophic Cardiomyopathy Using Machine Learning

Circulation Genomic and Precision Medicine ◽

10.1161/circgen.120.003259 ◽

2021 ◽

Author(s):

Lusha W. Liang ◽

Michael A. Fifer ◽

Kohei Hasegawa ◽

Mathew S. Maurer ◽

Muredach P. Reilly ◽

...

Keyword(s):

Machine Learning ◽

Genetic Testing ◽

Hypertrophic Cardiomyopathy ◽

Predictive Value ◽

External Validation ◽

Scoring Systems ◽

Training Set ◽

Test Set ◽

Net Reclassification Improvement ◽

Mayo Score

Background - Genetic testing can determine family screening strategies and has prognostic and diagnostic value in hypertrophic cardiomyopathy (HCM). However, it can also pose a significant psychosocial burden. Conventional scoring systems offer modest ability to predict genotype positivity. The aim of our study was to develop a novel prediction model for genotype positivity in patients with HCM by applying machine learning (ML) algorithms. Methods - We constructed three ML models using readily available clinical and cardiac imaging data of 102 patients from Columbia University with HCM who had undergone genetic testing (the training set). We validated model performance on 76 patients with HCM from Massachusetts General Hospital (the test set). Within the test set, we compared the area under the receiver operating characteristic curves (AUCs) for the ML models against the AUCs generated by the Toronto HCM Genotype Score ("the Toronto score") and Mayo HCM Genotype Predictor ("the Mayo score") using the Delong test and net reclassification improvement (NRI). Results - Overall, 63 of the 178 patients (35%) were genotype positive. The random forest ML model developed in the training set demonstrated an AUC of 0.92 (95% CI 0.85-0.99) in predicting genotype positivity in the test set, significantly outperforming the Toronto score (AUC 0.77, 95% CI 0.65-0.90, p=0.004, NRI: p<0.001) and the Mayo score (AUC 0.79, 95% CI 0.67-0.92, p=0.01, NRI: p=0.001). The gradient boosted decision tree ML model also achieved significant NRI over the Toronto score (p<0.001) and the Mayo score (p=0.03), with an AUC of 0.87 (95% CI 0.75-0.99). Compared to the Toronto and Mayo scores, all three ML models had higher sensitivity, positive predictive value, and negative predictive value. Conclusions - Our ML models demonstrated a superior ability to predict genotype positivity in patients with HCM compared to conventional scoring systems in an external validation test set.

Download Full-text

Histopathological Images and Multi-Omics Integration Predict Molecular Characteristics and Survival in Lung Adenocarcinoma

Frontiers in Cell and Developmental Biology ◽

10.3389/fcell.2021.720110 ◽

2021 ◽

Vol 9 ◽

Author(s):

Linyan Chen ◽

Hao Zeng ◽

Yu Xiang ◽

Yeqian Huang ◽

Yuling Luo ◽

...

Keyword(s):

Machine Learning ◽

Lung Adenocarcinoma ◽

Image Features ◽

Molecular Characteristics ◽

Learning Models ◽

Training Set ◽

Test Set ◽

Genetic Aberrations ◽

Histopathological Images ◽

Machine Learning Models

Histopathological images and omics profiles play important roles in prognosis of cancer patients. Here, we extracted quantitative features from histopathological images to predict molecular characteristics and prognosis, and integrated image features with mutations, transcriptomics, and proteomics data for prognosis prediction in lung adenocarcinoma (LUAD). Patients obtained from The Cancer Genome Atlas (TCGA) were divided into training set (n = 235) and test set (n = 235). We developed machine learning models in training set and estimated their predictive performance in test set. In test set, the machine learning models could predict genetic aberrations: ALK (AUC = 0.879), BRAF (AUC = 0.847), EGFR (AUC = 0.855), ROS1 (AUC = 0.848), and transcriptional subtypes: proximal-inflammatory (AUC = 0.897), proximal-proliferative (AUC = 0.861), and terminal respiratory unit (AUC = 0.894) from histopathological images. Moreover, we obtained tissue microarrays from 316 LUAD patients, including four external validation sets. The prognostic model using image features was predictive of overall survival in test and four validation sets, with 5-year AUCs from 0.717 to 0.825. High-risk and low-risk groups stratified by the model showed different survival in test set (HR = 4.94, p < 0.0001) and three validation sets (HR = 1.64–2.20, p < 0.05). The combination of image features and single omics had greater prognostic power in test set, such as histopathology + transcriptomics model (5-year AUC = 0.840; HR = 7.34, p < 0.0001). Finally, the model integrating image features with multi-omics achieved the best performance (5-year AUC = 0.908; HR = 19.98, p < 0.0001). Our results indicated that the machine learning models based on histopathological image features could predict genetic aberrations, transcriptional subtypes, and survival outcomes of LUAD patients. The integration of histopathological images and multi-omics may provide better survival prediction for LUAD.

Download Full-text

Machine learning for identifying resistance features of Klebsiella pneumoniae using whole-genome sequence single nucleotide polymorphisms

Journal of Medical Microbiology ◽

10.1099/jmm.0.001474 ◽

2021 ◽

Vol 70 (11) ◽

Author(s):

Wenjia Liu ◽

Nanjiao Ying ◽

Qiusi Mo ◽

Shanshan Li ◽

Mengjie Shao ◽

...

Keyword(s):

Machine Learning ◽

Drug Resistance ◽

Resistance Genes ◽

Type Species ◽

Whole Genome ◽

Training Set ◽

Test Set ◽

Content Type ◽

Machine Learning Methods ◽

Link Type

Introduction. Klebsiella pneumoniae , a gram-negative bacterium, is a common pathogen causing nosocomial infection. The drug-resistance rate of K. pneumoniae is increasing year by year, posing a severe threat to public health worldwide. K. pneumoniae has been listed as one of the pathogens causing the global crisis of antimicrobial resistance in nosocomial infections. We need to explore the drug resistance of K. pneumoniae for clinical diagnosis. Single nucleotide polymorphisms (SNPs) are of high density and have rich genetic information in whole-genome sequencing (WGS), which can affect the structure or expression of proteins. SNPs can be used to explore mutation sites associated with bacterial resistance. Hypothesis/Gap Statement. Machine learning methods can detect genetic features associated with the drug resistance of K. pneumoniae from whole-genome SNP data. Aims. This work used Fast Feature Selection (FFS) and Codon Mutation Detection (CMD) machine learning methods to detect genetic features related to drug resistance of K. pneumoniae from whole-genome SNP data. Methods. WGS data on resistance of K. pneumoniae strains to four antibiotics (tetracycline, gentamicin, imipenem, amikacin) were downloaded from the European Nucleotide Archive (ENA). Sequence alignments were performed with MUMmer 3 to complete SNP calling using K. pneumoniae HS11286 chromosome as the reference genome. The FFS algorithm was applied to feature selection of the SNP dataset. The training set was constructed based on mutation sites with mutation frequency >0.995. Based on the original SNP training set, 70% of SNPs were randomly selected from each dataset as the test set to verify the accuracy of the training results. Finally, the resistance genes were obtained by the CMD algorithm and Venny. Results. The number of strains resistant to tetracycline, gentamicin, imipenem and amikacin was 931, 1048, 789 and 203, respectively. Machine learning algorithms were applied to the SNP training set and test set, and 28 and 23 resistance genes were predicted, respectively. The 28 resistance genes in the training set included 22 genes in the test set, which verified the accuracy of gene prediction. Among them, some genes (KPHS_35310, KPHS_18220, KPHS_35880, etc.) corresponded to known resistance genes (Eef2, lpxK, MdtC, etc). Logistic regression classifiers were established based on the identified SNPs in the training set. The area under the curves (AUCs) of the four antibiotics was 0.939, 0.950, 0.912 and 0.935, showing a strong ability to predict bacterial resistance. Conclusion. Machine learning methods can effectively be used to predict resistance genes and associated SNPs. The FFS and CMD algorithms have wide applicability. They can be used for the drug-resistance analysis of any microorganism with genomic variation and phenotypic data. This work lays a foundation for resistance research in clinical applications.

Download Full-text

Abstract 349: Prognostication for Out-of-Hospital Cardiogenic Cardiac Arrest Patients Using Advanced Machine Learning Technique

Circulation ◽

10.1161/circ.138.suppl_2.349 ◽

2018 ◽

Vol 138 (Suppl_2) ◽

Author(s):

Tomohisa Seki ◽

Tomoyoshi Tamura ◽

Masaru Suzuki

Keyword(s):

Machine Learning ◽

Cardiac Arrest ◽

Study Data ◽

Machine Learning Techniques ◽

Operating Characteristics ◽

Training Set ◽

Machine Learning Technique ◽

Test Set ◽

Medical Institutions ◽

Learning Technique

Introduction and Objective: Early prognostication for cardiogenic out-of-hospital cardiac arrest (OHCA) patients remain challenging. Recently, advanced machine learning techniques have been employed for clinical diagnosis and prognostication for various conditions. Therefore, in this study, we attempted to establish a prognostication model for cardiogenic OHCA using an advanced machine learning technique. Methods and Results: Data of a prospective multi-center cohort study of OHCA patients transported by an ambulance to 67 medical institutions in Kanto area of Japan between January 2012 and March 2013 was used in this study. Data for cardiogenic OHCA patients aged ≥18 years were retrieved and patients were grouped according to the time of calls for ambulances (training set: between January 1, 2012 and December 12, 2012; test set: between January 1, 2013 and March 31, 2013). From among 421 variables observed during the period between calls for ambulances and initial in-hospital treatments of cardiogenic OHCA, 38 prehospital factors or 56 prehospital factors and initial in-hospital factors were used for prognostication, respectively. Prognostication models for 1-year survival were established with random forest method, an advanced machine learning method that aggregates a series of decision trees for classification and regression. After 10-fold internal cross validation in the training set, prognostication models were validated using test set. Area under the receiver operating characteristics curve (AUC) was used to evaluate the prediction performance of models. Prognostication models trained with 38 variables or 56 variables for 1-year survival showed AUC values of 0.93±0.01 and 0.95±0.01, respectively. Conclusions: Prognostication models trained with advanced machine learning technique showed favorable prediction capability for 1-year survival of cardiogenic OHCA. These results indicate that an advanced machine learning technique can be applicable to establish early prognostication model for cardiogenic OHCA.

Download Full-text

Quantitative Structure-Activity Relationship Study for HIV-1 LEDGF/p75 Inhibitors

Current Computer - Aided Drug Design ◽

10.2174/1573409915666190919153959 ◽

2020 ◽

Vol 16 (5) ◽

pp. 654-666 ◽

Cited By ~ 1

Author(s):

Yang Li ◽

Yujia Tian ◽

Yao Xi ◽

Zijian Qin ◽

Aixia Yan

Keyword(s):

Quantitative Structure Activity Relationship ◽

Structure Activity Relationship ◽

Activity Relationship ◽

Support Vector ◽

Quantitative Structure ◽

Training Set ◽

Test Set ◽

Structure Activity ◽

Consensus Models ◽

Hiv 1

Background: HIV-1 Integrase (IN) is an important target for the development of the new anti-AIDS drugs. HIV-1 LEDGF/p75 inhibitors, which block the integrase and LEDGF/p75 interaction, have been validated for reduction in HIV-1 viral replicative capacity. Methods: In this work, computational Quantitative Structure-Activity Relationship (QSAR) models were developed for predicting the bioactivity of HIV-1 integrase LEDGF/p75 inhibitors. We collected 190 inhibitors and their bioactivities in this study and divided the inhibitors into nine scaffolds by the method of T-distributed Stochastic Neighbor Embedding (TSNE). These 190 inhibitors were split into a training set and a test set according to the result of a Kohonen’s self-organizing map (SOM) or randomly. Multiple Linear Regression (MLR) models, support vector machine (SVM) models and two consensus models were built based on the training sets by 20 selected CORINA Symphony descriptors. Results: All the models showed a good prediction of pIC50. The correlation coefficients of all the models were more than 0.7 on the test set. For the training set of consensus Model C1, which performed better than other models, the correlation coefficient(r) achieved 0.909 on the training set, and 0.804 on the test set. Conclusion: The selected molecular descriptors show that hydrogen bond acceptor, atom charges and electronegativities (especially π atom) were important in predicting the activity of HIV-1 integrase LEDGF/p75-IN inhibitors.

Download Full-text

Improved personalized survival prediction of patients with diffuse large B-cell Lymphoma using gene expression profiling

BMC Cancer ◽

10.1186/s12885-020-07492-y ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Adrián Mosquera Orgueira ◽

José Ángel Díaz Arias ◽

Miguel Cid López ◽

Andrés Peleteiro Raíndo ◽

Beatriz Antelo Rodríguez ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Expression Profiling ◽

Cell Lymphoma ◽

Clinical Information ◽

B Cell Lymphoma ◽

Training Set ◽

Test Set ◽

Large B Cell Lymphoma ◽

Large B Cell

Abstract Background Thirty to forty percent of patients with Diffuse Large B-cell Lymphoma (DLBCL) have an adverse clinical evolution. The increased understanding of DLBCL biology has shed light on the clinical evolution of this pathology, leading to the discovery of prognostic factors based on gene expression data, genomic rearrangements and mutational subgroups. Nevertheless, additional efforts are needed in order to enable survival predictions at the patient level. In this study we investigated new machine learning-based models of survival using transcriptomic and clinical data. Methods Gene expression profiling (GEP) of in 2 different publicly available retrospective DLBCL cohorts were analyzed. Cox regression and unsupervised clustering were performed in order to identify probes associated with overall survival on the largest cohort. Random forests were created to model survival using combinations of GEP data, COO classification and clinical information. Cross-validation was used to compare model results in the training set, and Harrel’s concordance index (c-index) was used to assess model’s predictability. Results were validated in an independent test set. Results Two hundred thirty-three and sixty-four patients were included in the training and test set, respectively. Initially we derived and validated a 4-gene expression clusterization that was independently associated with lower survival in 20% of patients. This pattern included the following genes: TNFRSF9, BIRC3, BCL2L1 and G3BP2. Thereafter, we applied machine-learning models to predict survival. A set of 102 genes was highly predictive of disease outcome, outperforming available clinical information and COO classification. The final best model integrated clinical information, COO classification, 4-gene-based clusterization and the expression levels of 50 individual genes (training set c-index, 0.8404, test set c-index, 0.7942). Conclusion Our results indicate that DLBCL survival models based on the application of machine learning algorithms to gene expression and clinical data can largely outperform other important prognostic variables such as disease stage and COO. Head-to-head comparisons with other risk stratification models are needed to compare its usefulness.

Download Full-text