scholarly journals Developing a machine learning model to identify protein–protein interaction hotspots to facilitate drug discovery

PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e10381
Author(s):  
Rohit Nandakumar ◽  
Valentin Dinu

Throughout the history of drug discovery, an enzymatic-based approach for identifying new drug molecules has been primarily utilized. Recently, protein–protein interfaces that can be disrupted to identify small molecules that could be viable targets for certain diseases, such as cancer and the human immunodeficiency virus, have been identified. Existing studies computationally identify hotspots on these interfaces, with most models attaining accuracies of ~70%. Many studies do not effectively integrate information relating to amino acid chains and other structural information relating to the complex. Herein, (1) a machine learning model has been created and (2) its ability to integrate multiple features, such as those associated with amino-acid chains, has been evaluated to enhance the ability to predict protein–protein interface hotspots. Virtual drug screening analysis of a set of hotspots determined on the EphB2-ephrinB2 complex has also been performed. The predictive capabilities of this model offer an AUROC of 0.842, sensitivity/recall of 0.833, and specificity of 0.850. Virtual screening of a set of hotspots identified by the machine learning model developed in this study has identified potential medications to treat diseases caused by the overexpression of the EphB2-ephrinB2 complex, including prostate, gastric, colorectal and melanoma cancers which are linked to EphB2 mutations. The efficacy of this model has been demonstrated through its successful ability to predict drug-disease associations previously identified in literature, including cimetidine, idarubicin, pralatrexate for these conditions. In addition, nadolol, a beta blocker, has also been identified in this study to bind to the EphB2-ephrinB2 complex, and the possibility of this drug treating multiple cancers is still relatively unexplored.

However, oftentimes people just search a restaurant by using word “restaurant”, while the word “restaurant” means differently to different individuals. For an Asian, it can mean a “Chinese restaurant” or “Thai restaurant”. How to correctly interpret search requests based on people’s preference is a challenge. Building a machine-learning model based on activity history of a registered user can solve this problem. The activity histories used by this research are reviews and ratings from users. This project introduces a data processing pipeline, which uses reviews from registered users to generate a machine-learning model for each registered user. This project also defines an architecture, which uses the generated machine-learning models to support real-time personalized recommendations for restaurant searching and type of foods good at those recommended restaurants. Finally, this project aims to develop a good machine learning model, different collaborative filtering methodologies are considered to predict restaurants using user ratings. Slope One, k-Nearest Neighbors algorithm and multiclass SVM classification are some of the collaborating methodologies are going to consider in this project.


2021 ◽  
Vol 15 (8) ◽  
pp. 878-888
Author(s):  
Yang Liu ◽  
Xia-hui Ouyang ◽  
Zhi-Xiong Xiao ◽  
Le Zhang ◽  
Yang Cao

Background: T lymphocyte achieves an immune response by recognizing antigen peptides (also known as T cell epitopes) through major histocompatibility complex (MHC) molecules. The immunogenicity of T cell epitopes depends on their source and stability in combination with MHC molecules. The binding of the peptide to MHC is the most selective step, so predicting the binding affinity of the peptide to MHC is the principal step in predicting T cell epitopes. The identification of epitopes is of great significance in the research of vaccine design and T cell immune response. Objective: The traditional method for identifying epitopes is to synthesize and test the binding activity of peptide by experimental methods, which is not only time-consuming, but also expensive. In silico methods for predicting peptide-MHC binding emerge to pre-select candidate peptides for experimental testing, which greatly saves time and costs. By summarizing and analyzing these methods, we hope to have a better insight and provide guidance for future directions. Methods: Up to now, a number of methods have been developed to predict the binding ability of peptides to MHC based on various principles. Some of them employ matrix models or machine learning models based on the sequence characteristic embedded in peptides or MHC to predict the binding ability of peptides to MHC. Some others utilize the three-dimensional structural information of peptides or MHC, for example, by extracting three-dimensional structural information to construct a feature matrix or machine learning model, or directly using protein structure prediction, molecular docking to predict the binding mode of peptides and MHC. Results: Although the methods in predicting peptide-MHC binding based on the feature matrix or machine learning model can achieve high-throughput prediction, the accuracy of which depends heavily on the sequence characteristic of confirmed binding peptides. In addition, it cannot provide insights into the mechanism of antigen specificity. Therefore, such methods have certain limitations in practical applications. Methods in predicting peptide-MHC binding based on structural prediction or molecular docking are computationally intensive compared to the methods based on feature matrix or machine learning model and the challenge is how to predict a reliable structural model. Conclusion: This paper reviews the principles, advantages and disadvantages of the methods of peptide-MHC binding prediction and discussed the future directions to achieve more accurate predictions.


2018 ◽  
Vol 59 (3) ◽  
pp. 1221-1229 ◽  
Author(s):  
Antonius P. A. Janssen ◽  
Sebastian H. Grimm ◽  
Ruud H. M. Wijdeven ◽  
Eelke B. Lenselink ◽  
Jacques Neefjes ◽  
...  

Author(s):  
Eelke B. Lenselink ◽  
Pieter F. W. Stouten

AbstractAccurate prediction of lipophilicity—logP—based on molecular structures is a well-established field. Predictions of logP are often used to drive forward drug discovery projects. Driven by the SAMPL7 challenge, in this manuscript we describe the steps that were taken to construct a novel machine learning model that can predict and generalize well. This model is based on the recently described Directed-Message Passing Neural Networks (D-MPNNs). Further enhancements included: both the inclusion of additional datasets from ChEMBL (RMSE improvement of 0.03), and the addition of helper tasks (RMSE improvement of 0.04). To the best of our knowledge, the concept of adding predictions from other models (Simulations Plus logP and [email protected], respectively) as helper tasks is novel and could be applied in a broader context. The final model that we constructed and used to participate in the challenge ranked 2/17 ranked submissions with an RMSE of 0.66, and an MAE of 0.48 (submission: Chemprop). On other datasets the model also works well, especially retrospectively applied to the SAMPL6 challenge where it would have ranked number one out of all submissions (RMSE of 0.35). Despite the fact that our model works well, we conclude with suggestions that are expected to improve the model even further.


Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 1659-1659
Author(s):  
Srdan Verstovsek ◽  
Valerio De Stefano ◽  
Florian H. Heidel ◽  
Mike Zuurman ◽  
Michael Zaiac ◽  
...  

Introduction: Thromboembolic events (TEs) are one of the most prevalent complications in patients (pts) with polycythemia vera (PV). This real-world evidence study of the US OPTUM database evaluated the incidence of TEs in hydroxyurea (HU)-treated PV pts who either switched to ruxolitinib (RUX) after initial treatment (Tx) with HU (HU-RUX group) or continued HU Tx without switching (HU-alone group). Machine learning was then used to build a precise and scientifically robust model to predict the occurrence of TEs in PV pts with/without a history of TEs and HU failure (defined by either European LeukemiaNet [ELN] hematologic criteria or TEs). Methods: The OPTUM database comprises claims data and electronic medical records from 90 million pts (2007-2017, median stay in the database=7 years), including 69,464 PV pts. To avoid any selection bias during comparison, only pts treated prior to the RUX market launch were included in the HU-alone group (HU-RUX, n=81; HU-alone, n=195). Due to unavailability of Tx duration, time difference between the first and the last prescription was used as a proxy, and overall Tx duration was matched in both groups. TEs were assessed before Tx initiation in both groups. For HU-RUX pts, it was also assessed while on HU (median duration 27 months) and on RUX (median duration 14 months). For HU-alone pts, it was assessed during the first 27 months of Tx (any pt included in the analysis was treated for longer than this due to duration matching) and during remaining period of Tx (median duration 14 months). TEs were identified by either a restrictive definition (a list of ICD codes containing keywords from the RESPONSE study was automatically generated and manually curated) or a less restrictive one (list of ICD codes was manually expanded to include any TEs matching those from the GEMFIN study). PV pts who were exclusively treated with HU for ≥6 months were selected (n=2057) for modeling. Outcomes to be predicted were TEs in the 12 months following the end of the 6-month HU Tx period, and HU failure within 3 months of Tx. A logistic regression model was used for prediction. The baseline features extracted from the database included median lab parameters (3-6 months after HU initiation), history of thrombosis prior to primary diagnosis of PV, sociological features (age, gender), comorbidities, and concomitant medications (from inpatient/outpatient tables). Performance assessment methods included Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) in early stages and confusion matrix in later stages; the findings were converted to clinically interpretable decision-tree classification algorithms. Results: Based on the extensive definition, the annual incidence of TEs in the HU-RUX and HU-alone groups, respectively, was 9% and 7% before HU initiation, which increased to 17% and 13% on HU Tx. The small difference in baseline incidence may reflect residual differences between the two groups. After a median duration of 14 months, the incidence of TEs decreased to 15% in pts who switched to RUX vs an increase to 20% in pts who continued HU Tx. A similar trend was observed using less restrictive definition (Figure 1). This definition resulted in a substantially increased incidence of TEs and a decreased predictive power of the machine-learning model. Using modeling, decision trees were developed to predict the occurrence of TEs in PV pts with/without a history of TEs. Lymphocyte percentage (<17%) and red cell distribution width (RDW; <15%) were predictors in pts without a history of TEs, whereas lymphocyte percentage (>13%) and platelet count (>393x103/µL) were predictors in pts with a history of TEs (Figure 2). Based on the decision tree developed to predict HU failure, phlebotomy-dependent pts with >15% RDW had a higher risk of HU failure within 3 months of Tx (Figure 3). Conclusions: A reduction in the incidence of TEs was observed in pts switching to RUX vs those who continued HU Tx. Based on the findings from this machine-learning model in PV pts, phlebotomy dependency and RDW were indicated as predictors of HU Tx failure within 3 months, whereas lymphocyte percentage+platelet count and lymphocyte percentage+RDW were predictors of incidence of TEs in pts with and without a history of TEs, respectively. Non-adjustment of the results for antiplatelet/anticoagulant Tx was a study limitation. Further validation of this machine-learning model is planned in other European databases. Disclosures Verstovsek: Celgene: Consultancy, Research Funding; Gilead: Research Funding; Promedior: Research Funding; CTI BioPharma Corp: Research Funding; Genetech: Research Funding; Protaganist Therapeutics: Research Funding; Constellation: Consultancy; Pragmatist: Consultancy; Incyte: Research Funding; Roche: Research Funding; NS Pharma: Research Funding; Blueprint Medicines Corp: Research Funding; Novartis: Consultancy, Research Funding; Sierra Oncology: Research Funding; Pharma Essentia: Research Funding; Astrazeneca: Research Funding; Ital Pharma: Research Funding. De Stefano:Celgene: Consultancy, Honoraria, Speakers Bureau; Janssen: Consultancy, Honoraria, Speakers Bureau; Amgen: Consultancy, Honoraria, Speakers Bureau; Novartis: Consultancy, Honoraria, Research Funding, Speakers Bureau; Alexion: Consultancy, Honoraria, Speakers Bureau. Heidel:Novartis: Consultancy, Honoraria, Research Funding; Celgene: Consultancy; CTI: Consultancy. Zuurman:Novartis Pharma B.V.: Employment. Zaiac:Novartis: Employment, Equity Ownership. Bigan:Novartis: Consultancy. Ruhl:Novartis: Consultancy. Meier:Novartis: Consultancy. Kiladjian:Celgene: Consultancy; Novartis: Honoraria, Research Funding; AOP Orphan: Honoraria, Research Funding.


Sign in / Sign up

Export Citation Format

Share Document