Performance of machine-learning scoring functions in structure-based virtual screening

Interest in docking technologies has grown parallel to the ever increasing number and diversity of 3D models for macromolecular therapeutic targets. Structure-Based Virtual Screening (SBVS) aims at leveraging these experimental structures to discover the necessary starting points for the drug discovery process. It is now established that Machine Learning (ML) can strongly enhance the predictive accuracy of scoring functions for SBVS by exploiting large datasets from targets, molecules and their associations. However, with greater choice, the question of which ML-based scoring function is the most suitable for prospective use on a given target has gained importance. Here we analyse two approaches to select an existing scoring function for the target along with a third approach consisting in generating a scoring function tailored to the target. These analyses required discussing the limitations of popular SBVS benchmarks, the alternatives to benchmark scoring functions for SBVS and how to generate them or use them using freely-available software.

Download Full-text

Selecting Machine-Learning Scoring Functions for Structure-Based Virtual Screening

10.26434/chemrxiv.12967160.v1 ◽

2020 ◽

Author(s):

Pedro Ballester

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Virtual Screening ◽

Predictive Accuracy ◽

Scoring Function ◽

3D Models ◽

Large Datasets ◽

Scoring Functions ◽

Discovery Process ◽

Drug Discovery Process

Interest in docking technologies has grown parallel to the ever increasing number and diversity of 3D models for macromolecular therapeutic targets. Structure-Based Virtual Screening (SBVS) aims at leveraging these experimental structures to discover the necessary starting points for the drug discovery process. It is now established that Machine Learning (ML) can strongly enhance the predictive accuracy of scoring functions for SBVS by exploiting large datasets from targets, molecules and their associations. However, with greater choice, the question of which ML-based scoring function is the most suitable for prospective use on a given target has gained importance. Here we analyse two approaches to select an existing scoring function for the target along with a third approach consisting in generating a scoring function tailored to the target. These analyses required discussing the limitations of popular SBVS benchmarks, the alternatives to benchmark scoring functions for SBVS and how to generate them or use them using freely-available software.

Download Full-text

Beware of the generic machine learning-based scoring functions in structure-based virtual screening

Briefings in Bioinformatics ◽

10.1093/bib/bbaa070 ◽

2020 ◽

Cited By ~ 1

Author(s):

Chao Shen ◽

Ye Hu ◽

Zhe Wang ◽

Xujun Zhang ◽

Jinping Pang ◽

...

Keyword(s):

Machine Learning ◽

Virtual Screening ◽

Scoring Functions ◽

Training Set ◽

The Real ◽

Systematic Assessment ◽

Training Sets

Abstract Machine learning-based scoring functions (MLSFs) have attracted extensive attention recently and are expected to be potential rescoring tools for structure-based virtual screening (SBVS). However, a major concern nowadays is whether MLSFs trained for generic uses rather than a given target can consistently be applicable for VS. In this study, a systematic assessment was carried out to re-evaluate the effectiveness of 14 reported MLSFs in VS. Overall, most of these MLSFs could hardly achieve satisfactory results for any dataset, and they could even not outperform the baseline of classical SFs such as Glide SP. An exception was observed for RFscore-VS trained on the Directory of Useful Decoys-Enhanced dataset, which showed its superiority for most targets. However, in most cases, it clearly illustrated rather limited performance on the targets that were dissimilar to the proteins in the corresponding training sets. We also used the top three docking poses rather than the top one for rescoring and retrained the models with the updated versions of the training set, but only minor improvements were observed. Taken together, generic MLSFs may have poor generalization capabilities to be applicable for the real VS campaigns. Therefore, it should be quite cautious to use this type of methods for VS.

Download Full-text

Machine‐learning scoring functions for structure‐based virtual screening

Wiley Interdisciplinary Reviews Computational Molecular Science ◽

10.1002/wcms.1478 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 3

Author(s):

Hongjian Li ◽

Kam‐Heung Sze ◽

Gang Lu ◽

Pedro J. Ballester

Keyword(s):

Machine Learning ◽

Virtual Screening ◽

Scoring Functions

Download Full-text

Selecting machine-learning scoring functions for structure-based virtual screening

Drug Discovery Today Technologies ◽

10.1016/j.ddtec.2020.09.001 ◽

2019 ◽

Vol 32-33 ◽

pp. 81-87 ◽

Cited By ~ 1

Author(s):

Pedro J. Ballester

Keyword(s):

Machine Learning ◽

Virtual Screening ◽

Scoring Functions

Download Full-text

Convex-PLR – Revisiting affinity predictions and virtual screening using physics-informed machine learning

10.1101/2021.09.13.460049 ◽

2021 ◽

Author(s):

Maria Kadukova ◽

Vladimir Chupin ◽

Sergei Grudinin

Keyword(s):

Machine Learning ◽

Virtual Screening ◽

Scoring Function ◽

Structural Data ◽

Classification Problem ◽

Substantial Improvement ◽

Screening Tests ◽

Conformational Sampling ◽

Scoring Functions ◽

New Model

AbstractVirtual screening is an essential part of the modern drug design pipeline, which significantly accelerates the discovery of new drug candidates. Structure-based virtual screening involves ligand conformational sampling, which is often followed by re-scoring of docking poses. A great variety of scoring functions have been designed for this purpose. The advent of structural and affinity databases and the progress in machine-learning methods have recently boosted scoring function performance. Nonetheless, the most successful scoring functions are typically designed for specific tasks or systems. All-purpose scoring functions still perform poorly on the virtual screening tests, compared to precision with which they are able to predict co-crystal binding poses. Another limitation is the low interpretability of the heuristics being used.We analyzed scoring functions’ performance in the CASF benchmarks and discovered that the vast majority of them have a strong bias towards predicting larger binding interfaces. This motivated us to develop a physical model with additional entropic terms with the aim of penalizing such a preference. We parameterized the new model using affinity and structural data, solving a classification problem followed by regression. The new model, called Convex-PLR, demonstrated high-quality results on multiple tests and a substantial improvement over its predecessor Convex-PL. Convex-PLR can be used for molecular docking together with VinaCPL, our version of AutoDock Vina, with Convex-PL integrated as a scoring function. Convex-PLR, Convex-PL, and VinaCPL are available at https://team.inria.fr/nano-d/convex-pl/.

Download Full-text

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>

Download Full-text

Machine Learning-Based Scoring Functions. Development and Applications with SAnDReS.

Current Medicinal Chemistry ◽

10.2174/0929867327666200515101820 ◽

2020 ◽

Vol 27 ◽

Author(s):

Gabriela Bitencourt-Ferreira ◽

Camila Rizzotto ◽

Walter Filgueira de Azevedo Junior

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Drug Targets ◽

Computational Models ◽

Factor Xa ◽

Coagulation Factor ◽

Predictive Performance ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Molegro Virtual Docker

Background: Analysis of atomic coordinates of protein-ligand complexes can provide three-dimensional data to generate computational models to evaluate binding affinity and thermodynamic state functions. Application of machine learning techniques can create models to assess protein-ligand potential energy and binding affinity. These methods show superior predictive performance when compared with classical scoring functions available in docking programs. Objective: Our purpose here is to review the development and application of the program SAnDReS. We describe the creation of machine learning models to assess the binding affinity of protein-ligand complexes. Method: SAnDReS implements machine learning methods available in the scikit-learn library. This program is available for download at https://github.com/azevedolab/sandres. SAnDReS uses crystallographic structures, binding, and thermodynamic data to create targeted scoring functions. Results: Recent applications of the program SAnDReS to drug targets such as Coagulation factor Xa, cyclin-dependent kinases, and HIV-1 protease were able to create targeted scoring functions to predict inhibition of these proteins. These targeted models outperform classical scoring functions. Conclusion: Here, we reviewed the development of machine learning scoring functions to predict binding affinity through the application of the program SAnDReS. Our studies show the superior predictive performance of the SAnDReS-developed models when compared with classical scoring functions available in the programs such as AutoDock4, Molegro Virtual Docker, and AutoDock Vina.

Download Full-text

Applications of Quantitative Structure-Activity Relationships (QSAR) based Virtual Screening in Drug Design: A Review

Mini-Reviews in Medicinal Chemistry ◽

10.2174/1389557520666200429102334 ◽

2020 ◽

Vol 20 (14) ◽

pp. 1375-1388 ◽

Cited By ~ 2

Author(s):

Patnala Ganga Raju Achary

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Virtual Screening ◽

Model Building ◽

Chemical Space ◽

Qsar Model ◽

Quantitative Structure ◽

Efficient Manner ◽

Qsar Analysis ◽

Structure Activity

The scientists, and the researchers around the globe generate tremendous amount of information everyday; for instance, so far more than 74 million molecules are registered in Chemical Abstract Services. According to a recent study, at present we have around 1060 molecules, which are classified as new drug-like molecules. The library of such molecules is now considered as ‘dark chemical space’ or ‘dark chemistry.’ Now, in order to explore such hidden molecules scientifically, a good number of live and updated databases (protein, cell, tissues, structure, drugs, etc.) are available today. The synchronization of the three different sciences: ‘genomics’, proteomics and ‘in-silico simulation’ will revolutionize the process of drug discovery. The screening of a sizable number of drugs like molecules is a challenge and it must be treated in an efficient manner. Virtual screening (VS) is an important computational tool in the drug discovery process; however, experimental verification of the drugs also equally important for the drug development process. The quantitative structure-activity relationship (QSAR) analysis is one of the machine learning technique, which is extensively used in VS techniques. QSAR is well-known for its high and fast throughput screening with a satisfactory hit rate. The QSAR model building involves (i) chemo-genomics data collection from a database or literature (ii) Calculation of right descriptors from molecular representation (iii) establishing a relationship (model) between biological activity and the selected descriptors (iv) application of QSAR model to predict the biological property for the molecules. All the hits obtained by the VS technique needs to be experimentally verified. The present mini-review highlights: the web-based machine learning tools, the role of QSAR in VS techniques, successful applications of QSAR based VS leading to the drug discovery and advantages and challenges of QSAR.

Download Full-text