AntiCP 2.0: an updated model for predicting anticancer peptides

Briefings in Bioinformatics ◽

10.1093/bib/bbaa153 ◽

2020 ◽

Cited By ~ 3

Author(s):

Piyush Agrawal ◽

Dhruv Bhagat ◽

Manish Mahalwal ◽

Neelam Sharma ◽

Gajendra P S Raghava

Keyword(s):

Machine Learning ◽

Training Dataset ◽

Validation Dataset ◽

Composition Analysis ◽

Operating Characteristics ◽

Motif Analysis ◽

Anticancer Peptides ◽

C Terminus ◽

Therapeutic Peptides ◽

Model Training

Abstract Increasing use of therapeutic peptides for treating cancer has received considerable attention of the scientific community in the recent years. The present study describes the in silico model developed for predicting and designing anticancer peptides (ACPs). ACPs residue composition analysis show the preference of A, F, K, L and W. Positional preference analysis revealed that residues A, F and K are favored at N-terminus and residues L and K are preferred at C-terminus. Motif analysis revealed the presence of motifs like LAKLA, AKLAK, FAKL and LAKL in ACPs. Machine learning models were developed using various input features and implementing different machine learning classifiers on two datasets main and alternate dataset. In the case of main dataset, dipeptide composition based ETree classifier model achieved maximum Matthews correlation coefficient (MCC) of 0.51 and 0.83 area under receiver operating characteristics (AUROC) on the training dataset. In the case of alternate dataset, amino acid composition based ETree classifier performed best and achieved the highest MCC of 0.80 and AUROC of 0.97 on the training dataset. Five-fold cross-validation technique was implemented for model training and testing, and their performance was also evaluated on the validation dataset. Best models were implemented in the webserver AntiCP 2.0, which is freely available at https://webs.iiitd.edu.in/raghava/anticp2/. The webserver is compatible with multiple screens such as iPhone, iPad, laptop and android phones. The standalone version of the software is available at GitHub; docker-based container also developed.

Download Full-text

AntiCP 2.0: An updated model for predicting anticancer peptides

10.1101/2020.03.23.003780 ◽

2020 ◽

Author(s):

Piyush Agrawal ◽

Dhruv Bhagat ◽

Manish Mahalwal ◽

Neelam Sharma ◽

Gajendra P. S. Raghava

Keyword(s):

Prediction Models ◽

Training Dataset ◽

Validation Dataset ◽

Composition Analysis ◽

Motif Analysis ◽

Anticancer Peptides ◽

C Terminus ◽

Therapeutic Peptides ◽

Validation Technique ◽

Fold Cross Validation

AbstractIncreasing use of therapeutic peptides for treating cancer has received considerable attention of the scientific community in the recent years. The present study describes the in silico model developed for predicting and designing anticancer peptides (ACPs). ACPs residue composition analysis revealed the preference of A, F, K, L and W. Positional preference analysis revealed that residue A, F and K are preferred at N-terminus and residue L and K are preferred at C-terminus. Motif analysis revealed the presence of motifs like LAKLA, AKLAK, FAKL, LAKL in ACPs. Prediction models were developed using various input features and implementing different machine learning classifiers on two datasets main and alternate dataset. In the case of main dataset, ETree Classifier based model developed using dipeptide composition achieved maximum MCC of 0.51 and 0.83 AUROC on the training dataset. In the case of alternate dataset, ETree Classifier based model developed using amino acid composition performed best and achieved the highest MCC of 0.80 and AUROC of 0.97 on the training dataset. Models were trained and tested using five-fold cross validation technique and their performance was also evaluated on the validation dataset. Best models were implemented in the webserver AntiCP 2.0, freely available at https://webs.iiitd.edu.in/raghava/anticp2. The webserver is compatible with multiple screens such as iPhone, iPad, laptop, and android phones. The standalone version of the software is provided in the form of GitHub package as well as in docker technology.

Download Full-text

KDClassifier: Urinary Proteomic Spectra Analysis Based on Machine Learning for Classification of Kidney Diseases

10.1101/2020.12.01.20242198 ◽

2020 ◽

Author(s):

Wanjun Zhao ◽

Yong Zhang ◽

Xinming Li ◽

Yonghong Mao ◽

Changwei Wu ◽

...

Keyword(s):

Machine Learning ◽

Mass Spectrum ◽

Kidney Disease ◽

Kidney Diseases ◽

Training Dataset ◽

Validation Dataset ◽

Support Vector ◽

Urinary Proteomics ◽

Diagnosis Model

AbstractBackgroundBy extracting the spectrum features from urinary proteomics based on an advanced mass spectrometer and machine learning algorithms, more accurate reporting results can be achieved for disease classification. We attempted to establish a novel diagnosis model of kidney diseases by combining machine learning with an extreme gradient boosting (XGBoost) algorithm with complete mass spectrum information from the urinary proteomics.MethodsWe enrolled 134 patients (including those with IgA nephropathy, membranous nephropathy, and diabetic kidney disease) and 68 healthy participants as a control, and for training and validation of the diagnostic model, applied a total of 610,102 mass spectra from their urinary proteomics produced using high-resolution mass spectrometry. We divided the mass spectrum data into a training dataset (80%) and a validation dataset (20%). The training dataset was directly used to create a diagnosis model using XGBoost, random forest (RF), a support vector machine (SVM), and artificial neural networks (ANNs). The diagnostic accuracy was evaluated using a confusion matrix. We also constructed the receiver operating-characteristic, Lorenz, and gain curves to evaluate the diagnosis model.ResultsCompared with RF, the SVM, and ANNs, the modified XGBoost model, called a Kidney Disease Classifier (KDClassifier), showed the best performance. The accuracy of the diagnostic XGBoost model was 96.03% (CI = 95.17%-96.77%; Kapa = 0.943; McNemar’s Test, P value = 0.00027). The area under the curve of the XGBoost model was 0.952 (CI = 0.9307-0.9733). The Kolmogorov-Smirnov (KS) value of the Lorenz curve was 0.8514. The Lorenz and gain curves showed the strong robustness of the developed model.ConclusionsThis study presents the first XGBoost diagnosis model, i.e., the KDClassifier, combined with complete mass spectrum information from the urinary proteomics for distinguishing different kidney diseases. KDClassifier achieves a high accuracy and robustness, providing a potential tool for the classification of all types of kidney diseases.

Download Full-text

Discovery of Highly Polymorphic Organic Materials: A New Machine Learning Approach

10.26434/chemrxiv.9524219 ◽

2019 ◽

Author(s):

Zied Hosni ◽

Annalisa Riccardi ◽

Stephanie Yerdelen ◽

Alan R. G. Martin ◽

Deborah Bowering ◽

...

Keyword(s):

Machine Learning ◽

Structure Prediction ◽

External Validation ◽

New Drugs ◽

Training Dataset ◽

Validation Dataset ◽

Machine Learning Classification ◽

Novel Approach ◽

Physical Form ◽

Machine Learning Approach

<div><div><p>Polymorphism is the capacity of a molecule to adopt different conformations or molecular packing arrangements in the solid state. This is a key property to control during pharmaceutical manufacturing because it can impact a range of properties including stability and solubility. In this study, a novel approach based on machine learning classification methods is used to predict the likelihood for an organic compound to crystallise in multiple forms. A training dataset of drug-like molecules was curated from the Cambridge Structural Database (CSD) and filtered according to entries in the Drug Bank database. The number of separate forms in the CSD for each molecule was recorded. A metaclassifier was trained using this dataset to predict the expected number of crystalline forms from the compound descriptors. This approach was used to estimate the number of crystallographic forms for an external validation dataset. These results suggest this novel methodology can be used to predict the extent of polymorphism of new drugs or not-yet experimentally screened molecules. This promising method complements expensive ab initio methods for crystal structure prediction and as integral to experimental physical form screening, may identify systems that with unexplored potential.</p> </div> </div>

Download Full-text

A real-world demonstration of machine learning generalizability in the detection of intracranial hemorrhage on head computerized tomography

Scientific Reports ◽

10.1038/s41598-021-95533-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hojjat Salehinejad ◽

Jumpei Kitamura ◽

Noah Ditkofsky ◽

Amy Lin ◽

Aditya Bharatha ◽

...

Keyword(s):

Machine Learning ◽

Medical Imaging ◽

Intracranial Hemorrhage ◽

Real World ◽

External Validation ◽

Model Performance ◽

Training Dataset ◽

Validation Dataset ◽

Great Promise ◽

Clinical Environments

AbstractMachine learning (ML) holds great promise in transforming healthcare. While published studies have shown the utility of ML models in interpreting medical imaging examinations, these are often evaluated under laboratory settings. The importance of real world evaluation is best illustrated by case studies that have documented successes and failures in the translation of these models into clinical environments. A key prerequisite for the clinical adoption of these technologies is demonstrating generalizable ML model performance under real world circumstances. The purpose of this study was to demonstrate that ML model generalizability is achievable in medical imaging with the detection of intracranial hemorrhage (ICH) on non-contrast computed tomography (CT) scans serving as the use case. An ML model was trained using 21,784 scans from the RSNA Intracranial Hemorrhage CT dataset while generalizability was evaluated using an external validation dataset obtained from our busy trauma and neurosurgical center. This real world external validation dataset consisted of every unenhanced head CT scan (n = 5965) performed in our emergency department in 2019 without exclusion. The model demonstrated an AUC of 98.4%, sensitivity of 98.8%, and specificity of 98.0%, on the test dataset. On external validation, the model demonstrated an AUC of 95.4%, sensitivity of 91.3%, and specificity of 94.1%. Evaluating the ML model using a real world external validation dataset that is temporally and geographically distinct from the training dataset indicates that ML generalizability is achievable in medical imaging applications.

Download Full-text

Predicting One-Year Outcome in First Episode Psychosis using Machine Learning

10.31234/osf.io/fvwgt ◽

2018 ◽

Author(s):

Samuel Leighton ◽

Rajeev Krishnadas ◽

Kelly Chung ◽

Alison Blair ◽

Susie Brown ◽

...

Keyword(s):

Machine Learning ◽

First Episode Psychosis ◽

Elastic Net ◽

Training Dataset ◽

Validation Dataset ◽

First Episode ◽

Episode Psychosis ◽

Symptom Remission ◽

One Year ◽

Independent Cohort

BackgroundEarly illness course correlates with long-term outcome in psychosis. Accurate prediction could allow more focused intervention. Earlier intervention corresponds to significantly better symptomatic and functional outcomes. Our study objective is to use routinely collected baseline demographic and clinical characteristics to predict employment, education or training (EET) status, and symptom remission in patients with first episode psychosis (FEP) at one-year.Methods and findings83 FEP patients were recruited from National Health Service (NHS) Glasgow between 2011 and 2014 to a 24-month prospective cohort study with regular assessment of demographic and psychometric measures. An external independent cohort of 79 FEP patients were recruited from NHS Glasgow and Edinburgh during a 12-month study between 2006 and 2009. Elastic net regularised logistic regression models were built to predict binary EET status, period and point remission outcomes at one-year on 83 Glasgow patients (training dataset). Models were externally validated on an independent dataset of 79 patients from Glasgow and Edinburgh (validation dataset). Only baseline predictors shared across both cohorts were made available for model training and validation. After excluding participants with missing outcomes, models were built on the training dataset for EET status, period and point remission outcomes and externally validated on the validation dataset. Models predicted EET status, period and point remission with ROC area under curve (AUC) performances of 0.876 (95%CI: 0.864, 0.887), 0.630 (95%CI: 0.612, 0.647) and 0.652 (95%CI: 0.635, 0.670) respectively. Positive predictors of EET included baseline EET and living with spouse/children. Negative predictors included higher PANSS suspiciousness, hostility and delusions scores. Positive predictors for symptom remission included living with spouse/children, and affective symptoms on the Positive and Negative Syndrome Scale (PANSS). Negative predictors of remission included passive social withdrawal symptoms on PANSS. A key limitation of this study is the small sample size (n) relative to the number of predictors (p), whereby p approaches n. The use of elastic net regularised regression rather than ordinary least squares regression helped circumvent this difficulty. Further, we did not have information for biological and additional social variables, such as nicotine dependence, which observational studies have linked to outcomes in psychosis. Conclusions and RelevanceUsing advanced statistical machine learning techniques we provide the first externally validated evidence, in a temporally and geographically independent cohort, for the ability to predict one-year EET status and symptom remission in individual FEP patients.

Download Full-text

Evaluation of Different Machine Learning Models and Novel Deep Learning-based Algorithm for Landslide Susceptibility Mapping

10.21203/rs.3.rs-720898/v1 ◽

2021 ◽

Author(s):

Tingyu Zhang ◽

Huanyuan Wang ◽

Tianqing Chen ◽

Zenghui Sun ◽

Tao Wang ◽

...

Keyword(s):

Deep Learning ◽

Landslide Susceptibility ◽

Reference Model ◽

Susceptibility Mapping ◽

Landslide Susceptibility Mapping ◽

Training Dataset ◽

Validation Dataset ◽

Support Vector ◽

Operating Characteristics ◽

Predisposing Factor

Abstract The losses and damage caused by landslides are countless in the world every year. However, the existing approaches of landslide susceptibility mapping cannot fully meet the requirement of landslide prevention, and further excavation and innovation are also needed. Therefore, the main aim of this study is to develop a novel deep learning model namely landslide net (LSNet) to assess the landslide susceptibility in Hanyin County, China, meanwhile, support vector machine model (SVM) and kernel logistic regression model (KLR) were employed as reference model. The inventory map was generated based on 259 landslides, the training dataset and validation dataset were respectively prepared using 70% landslides and the remaining 30% landslides. The variance inflation factor (VIF) was applied to optimize each landslide predisposing factor. Three benchmark indices were used to evaluate the result of susceptibility mapping and area under receiver operating characteristics curve (AUROC) was used to compare the models. Result demonstrated that although the processing speed of LSNet model is the slowest, it still significantly outperformed its corresponding benchmark models with validation dataset, and has the highest accuracy (0.950), precision (0.951), F1 (0.951) and AUROC (0.941), which reflected excellent predictive ability in some degree. The achievements obtained in this study can improve the rapid response capability of landslide prevention for Hanyin County.

Download Full-text

Improving Pre-eclampsia Risk Prediction by Modeling Individualized Pregnancy Trajectories Derived from Routinely Collected Electronic Medical Record Data

10.1101/2021.03.23.21254178 ◽

2021 ◽

Author(s):

Shilong Li ◽

Zichen Wang ◽

Luciana A. Vieira ◽

Amanda B. Zheutlin ◽

Boshu Ru ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

At Risk ◽

Complete Blood Count ◽

Early Recognition ◽

Training Dataset ◽

Validation Dataset ◽

Time Period ◽

Patients At Risk ◽

Mount Sinai

ABSTRACTPreeclampsia (PE) is a heterogeneous and complex disease associated with rising morbidity and mortality in pregnant women and newborns in the US. Early recognition of patients at risk is a pressing clinical need to significantly reduce the risk of adverse pregnancy outcomes. We assessed whether information routinely collected and stored on women in their electronic medical records (EMR) could enhance the prediction of PE risk beyond what is achieved in standard of care assessments today. We developed a digital phenotyping algorithm to assemble and curate 108,557 pregnancies from EMRs across the Mount Sinai Health System (MSHS), accurately reconstructing pregnancy journeys and normalizing these journeys across different hospital EMR systems. We then applied machine learning approaches to a training dataset from Mount Sinai Hospital (MSH) (N = 60,879) to construct predictive models of PE across three major pregnancy time periods (ante-, intra-, and postpartum). The resulting models predicted PE with high accuracy across the different pregnancy periods, with areas under the receiver operating characteristic curves (AUC) of 0.92, 0.83 and 0.89 at 37 gestational weeks, intrapartum and postpartum, respectively. We observed comparable performance in two independent patient cohorts with diverse patient populations (MSH validation dataset N = 38,421 and Mount Sinai West dataset N = 9,257). While our machine learning approach identified known risk factors of PE (such as blood pressure, weight and maternal age), it also identified novel PE risk factors, such as complete blood count related characteristics for the antepartum time period and ibuprofen usage for the postpartum time period. Our model not only has utility for earlier identification of patients at risk for PE, but given the prediction accuracy substantially exceeds what is achieved today in clinical practice, our model provides a path for promoting personalized precision therapeutic strategies for patients at risk.

Download Full-text

Land Subsidence Susceptibility Mapping in South Korea Using Machine Learning Algorithms

Sensors ◽

10.3390/s18082464 ◽

2018 ◽

Vol 18 (8) ◽

pp. 2464 ◽

Cited By ~ 64

Author(s):

Dieu Tien Bui ◽

Himan Shahabi ◽

Ataollah Shirzadi ◽

Kamran Chapi ◽

Biswajeet Pradhan ◽

...

Keyword(s):

Machine Learning ◽

South Korea ◽

Land Subsidence ◽

Slope Angle ◽

Machine Learning Algorithms ◽

The Other ◽

Training Dataset ◽

Validation Dataset ◽

Support Vector ◽

Susceptibility Map

In this study, land subsidence susceptibility was assessed for a study area in South Korea by using four machine learning models including Bayesian Logistic Regression (BLR), Support Vector Machine (SVM), Logistic Model Tree (LMT) and Alternate Decision Tree (ADTree). Eight conditioning factors were distinguished as the most important affecting factors on land subsidence of Jeong-am area, including slope angle, distance to drift, drift density, geology, distance to lineament, lineament density, land use and rock-mass rating (RMR) were applied to modelling. About 24 previously occurred land subsidence were surveyed and used as training dataset (70% of data) and validation dataset (30% of data) in the modelling process. Each studied model generated a land subsidence susceptibility map (LSSM). The maps were verified using several appropriate tools including statistical indices, the area under the receiver operating characteristic (AUROC) and success rate (SR) and prediction rate (PR) curves. The results of this study indicated that the BLR model produced LSSM with higher acceptable accuracy and reliability compared to the other applied models, even though the other models also had reasonable results.

Download Full-text

Quantifying identifiability to choose and audit ϵ in differentially private deep learning

Proceedings of the VLDB Endowment ◽

10.14778/3484224.3484231 ◽

2021 ◽

Vol 14 (13) ◽

pp. 3335-3347

Author(s):

Daniel Bernau ◽

Günther Eibl ◽

Philip W. Grassal ◽

Hannah Keller ◽

Florian Kerschbaum

Keyword(s):

Machine Learning ◽

Differential Privacy ◽

Training Data ◽

Training Dataset ◽

Privacy Leakage ◽

Societal Norms ◽

Machine Learning Model ◽

Model Training ◽

Parameter Values ◽

Learning Data

Differential privacy allows bounding the influence that training data records have on a machine learning model. To use differential privacy in machine learning, data scientists must choose privacy parameters (ϵ, δ ). Choosing meaningful privacy parameters is key, since models trained with weak privacy parameters might result in excessive privacy leakage, while strong privacy parameters might overly degrade model utility. However, privacy parameter values are difficult to choose for two main reasons. First, the theoretical upper bound on privacy loss (ϵ, δ) might be loose, depending on the chosen sensitivity and data distribution of practical datasets. Second, legal requirements and societal norms for anonymization often refer to individual identifiability, to which (ϵ, δ ) are only indirectly related. We transform (ϵ, δ ) to a bound on the Bayesian posterior belief of the adversary assumed by differential privacy concerning the presence of any record in the training dataset. The bound holds for multidimensional queries under composition, and we show that it can be tight in practice. Furthermore, we derive an identifiability bound, which relates the adversary assumed in differential privacy to previous work on membership inference adversaries. We formulate an implementation of this differential privacy adversary that allows data scientists to audit model training and compute empirical identifiability scores and empirical (ϵ, δ ).

Download Full-text

Forest Fire Susceptibility Prediction Based on Machine Learning Models with Resampling Algorithms on Remote Sensing Data

Remote Sensing ◽

10.3390/rs12223682 ◽

2020 ◽

Vol 12 (22) ◽

pp. 3682

Author(s):

Bahareh Kalantar ◽

Naonori Ueda ◽

Mohammed O. Idrees ◽

Saeid Janizadeh ◽

Kourosh Ahmadi ◽

...

Keyword(s):

Machine Learning ◽

Forest Fire ◽

Spatial Relationship ◽

Remote Sensing Data ◽

Multivariate Adaptive Regression Splines ◽

Training Dataset ◽

Support Vector ◽

Operating Characteristics ◽

Mazandaran Province ◽

Boosted Regression Tree

This study predicts forest fire susceptibility in Chaloos Rood watershed in Iran using three machine learning (ML) models—multivariate adaptive regression splines (MARS), support vector machine (SVM), and boosted regression tree (BRT). The study utilizes 14 set of fire predictors derived from vegetation indices, climatic variables, environmental factors, and topographical features. To assess the suitability of the models and estimating the variance and bias of estimation, the training dataset obtained from the Natural Resources Directorate of Mazandaran province was subjected to resampling using cross validation (CV), bootstrap, and optimism bootstrap techniques. Using variance inflation factor (VIF), weight indicating the strength of the spatial relationship of the predictors to fire occurrence was assigned to each contributing variable. Subsequently, the models were trained and validated using the receiver operating characteristics (ROC) area under the curve (AUC) curve. Results of the model validation based on the resampling techniques (non, 5- and 10-fold CV, bootstrap and optimism bootstrap) produced AUC values of 0.78, 0.88, 0.90, 0.86 and 0.83 for the MARS model; 0.82, 0.82, 0.89, 0.87, 0.84 for the SVM and 0.87, 0.90, 0.90, 0.90, 0.91 for the BRT model. Across the individual model, the 10-fold CV performed best in MARS and SVM with AUC values of 0.90 and 0.89. Overall, the BRT outperformed the other models in all ramification with highest AUC value of 0.91 using optimism bootstrap resampling algorithm. Generally, the resampling process enhanced the prediction performance of all the models.

Download Full-text