Beware of the generic machine learning-based scoring functions in structure-based virtual screening

Interest in docking technologies has grown parallel to the ever increasing number and diversity of 3D models for macromolecular therapeutic targets. Structure-Based Virtual Screening (SBVS) aims at leveraging these experimental structures to discover the necessary starting points for the drug discovery process. It is now established that Machine Learning (ML) can strongly enhance the predictive accuracy of scoring functions for SBVS by exploiting large datasets from targets, molecules and their associations. However, with greater choice, the question of which ML-based scoring function is the most suitable for prospective use on a given target has gained importance. Here we analyse two approaches to select an existing scoring function for the target along with a third approach consisting in generating a scoring function tailored to the target. These analyses required discussing the limitations of popular SBVS benchmarks, the alternatives to benchmark scoring functions for SBVS and how to generate them or use them using freely-available software.

Download Full-text

Selecting Machine-Learning Scoring Functions for Structure-Based Virtual Screening

10.26434/chemrxiv.12967160.v1 ◽

2020 ◽

Author(s):

Pedro Ballester

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Virtual Screening ◽

Predictive Accuracy ◽

Scoring Function ◽

3D Models ◽

Large Datasets ◽

Scoring Functions ◽

Discovery Process ◽

Drug Discovery Process

Interest in docking technologies has grown parallel to the ever increasing number and diversity of 3D models for macromolecular therapeutic targets. Structure-Based Virtual Screening (SBVS) aims at leveraging these experimental structures to discover the necessary starting points for the drug discovery process. It is now established that Machine Learning (ML) can strongly enhance the predictive accuracy of scoring functions for SBVS by exploiting large datasets from targets, molecules and their associations. However, with greater choice, the question of which ML-based scoring function is the most suitable for prospective use on a given target has gained importance. Here we analyse two approaches to select an existing scoring function for the target along with a third approach consisting in generating a scoring function tailored to the target. These analyses required discussing the limitations of popular SBVS benchmarks, the alternatives to benchmark scoring functions for SBVS and how to generate them or use them using freely-available software.

Download Full-text

Combining Supervised and Unsupervised Machine Learning Methods for Phenotypic Functional Genomics Screening

SLAS DISCOVERY Advancing Life Sciences ◽

10.1177/2472555220919345 ◽

2020 ◽

Vol 25 (6) ◽

pp. 655-664

Author(s):

Wienand A. Omta ◽

Roy G. van Heesbeen ◽

Ian Shen ◽

Jacob de Nobel ◽

Desmond Robers ◽

...

Keyword(s):

Machine Learning ◽

Training Set ◽

Data Set ◽

Genome Wide ◽

Machine Learning Model ◽

Exploratory Data ◽

Interfering Rna ◽

Insight Into ◽

Training Sets

There has been an increase in the use of machine learning and artificial intelligence (AI) for the analysis of image-based cellular screens. The accuracy of these analyses, however, is greatly dependent on the quality of the training sets used for building the machine learning models. We propose that unsupervised exploratory methods should first be applied to the data set to gain a better insight into the quality of the data. This improves the selection and labeling of data for creating training sets before the application of machine learning. We demonstrate this using a high-content genome-wide small interfering RNA screen. We perform an unsupervised exploratory data analysis to facilitate the identification of four robust phenotypes, which we subsequently use as a training set for building a high-quality random forest machine learning model to differentiate four phenotypes with an accuracy of 91.1% and a kappa of 0.85. Our approach enhanced our ability to extract new knowledge from the screen when compared with the use of unsupervised methods alone.

Download Full-text

Can machine learning consistently improve the scoring power of classical scoring functions? Insights into the role of machine learning in scoring functions

Briefings in Bioinformatics ◽

10.1093/bib/bbz173 ◽

2020 ◽

Cited By ~ 4

Author(s):

Chao Shen ◽

Ye Hu ◽

Zhe Wang ◽

Xujun Zhang ◽

Haiyang Zhong ◽

...

Keyword(s):

Machine Learning ◽

Biological Activities ◽

Gradient Boosting ◽

Linear Regression Method ◽

Scoring Functions ◽

Binding Affinities ◽

Training Set ◽

Sequence Similarities ◽

Better Than

Abstract How to accurately estimate protein–ligand binding affinity remains a key challenge in computer-aided drug design (CADD). In many cases, it has been shown that the binding affinities predicted by classical scoring functions (SFs) cannot correlate well with experimentally measured biological activities. In the past few years, machine learning (ML)-based SFs have gradually emerged as potential alternatives and outperformed classical SFs in a series of studies. In this study, to better recognize the potential of classical SFs, we have conducted a comparative assessment of 25 commonly used SFs. Accordingly, the scoring power was systematically estimated by using the state-of-the-art ML methods that replaced the original multiple linear regression method to refit individual energy terms. The results show that the newly-developed ML-based SFs consistently performed better than classical ones. In particular, gradient boosting decision tree (GBDT) and random forest (RF) achieved the best predictions in most cases. The newly-developed ML-based SFs were also tested on another benchmark modified from PDBbind v2007, and the impacts of structural and sequence similarities were evaluated. The results indicated that the superiority of the ML-based SFs could be fully guaranteed when sufficient similar targets were contained in the training set. Moreover, the effect of the combinations of features from multiple SFs was explored, and the results indicated that combining NNscore2.0 with one to four other classical SFs could yield the best scoring power. However, it was not applicable to derive a generic target-specific SF or SF combination.

Download Full-text

Intelligent interpolation by Monte Carlo machine learning

Geophysics ◽

10.1190/geo2017-0294.1 ◽

2018 ◽

Vol 83 (2) ◽

pp. V83-V97 ◽

Cited By ~ 16

Author(s):

Yongna Jia ◽

Siwei Yu ◽

Jianwei Ma

Keyword(s):

Machine Learning ◽

Monte Carlo ◽

Monte Carlo Method ◽

Regression Model ◽

Seismic Data ◽

Support Vector ◽

Training Set ◽

Simple Method ◽

The Monte Carlo Method ◽

Training Sets

Acquisition technology advances, as well as the exploration of geologically complex areas, are pushing the quantity of data to be analyzed into the “big-data” era. In our related work, we found that a machine-learning method based on support vector regression (SVR) for seismic data intelligent interpolation can fully use large data as training data and can eliminate certain prior assumptions in the existing methods, such as linear events, sparsity, or low rank. However, immense training sets not only encompass high redundancy but also result in considerable computational costs, especially for high-dimensional seismic data. We have developed a criterion based on the Monte Carlo method for the intelligent reduction of training sets. For seismic data, pixel values in each local patch can be regarded as a set of statistical data and a variance value for the patch can be calculated. A high variance means that there are events centered around its corresponding patch or the pixel values in the patch range obviously. The patches with high variances are regarded as more representative patches. The Monte Carlo method assigns the variance as constraint and selects only the representative patches with a higher probability through a series of random positive numbers. After the training set is intelligently reduced through the Monte Carlo method, only these representative patches, constituting the new training set, are input to the SVR-based machine learning frame to construct a continuous regression model. Meanwhile, the patches with lower variances can be readily interpolated using a simple method and only present a minor influence in the construction of the regression model. Thus, the representative patches are called effective patches. Finally, the missing traces can be generated from the learned regression model. Numerical illustrations on 2D seismic data and results on 3D or 5D data show that the Monte Carlo method can intelligently select the effective patches as the new training set, which greatly decreases redundancy and also keeps the reconstruction quality.

Download Full-text

Machine learning for transient recognition in difference imaging with minimum sampling effort

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3096 ◽

2020 ◽

Vol 499 (4) ◽

pp. 6009-6017

Author(s):

Y-L Mong ◽

K Ackley ◽

D K Galloway ◽

T Killestein ◽

J Lyman ◽

...

Keyword(s):

Machine Learning ◽

Feature Representation ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Sampling Effort ◽

Training Set ◽

The Real ◽

Learning Techniques ◽

The Difference ◽

Difference Imaging

ABSTRACT The amount of observational data produced by time-domain astronomy is exponentially increasing. Human inspection alone is not an effective way to identify genuine transients from the data. An automatic real-bogus classifier is needed and machine learning techniques are commonly used to achieve this goal. Building a training set with a sufficiently large number of verified transients is challenging, due to the requirement of human verification. We present an approach for creating a training set by using all detections in the science images to be the sample of real detections and all detections in the difference images, which are generated by the process of difference imaging to detect transients, to be the samples of bogus detections. This strategy effectively minimizes the labour involved in the data labelling for supervised machine learning methods. We demonstrate the utility of the training set by using it to train several classifiers utilizing as the feature representation the normalized pixel values in 21 × 21 pixel stamps centred at the detection position, observed with the Gravitational-wave Optical Transient Observer (GOTO) prototype. The real-bogus classifier trained with this strategy can provide up to $95{{\ \rm per\ cent}}$ prediction accuracy on the real detections at a false alarm rate of $1{{\ \rm per\ cent}}$.

Download Full-text

Analogy-preserving functions: A way to extend Boolean samples

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/218 ◽

2017 ◽

Cited By ~ 10

Author(s):

Miguel Couceiro ◽

Nicolas Hug ◽

Henri Prade ◽

Gilles Richard

Keyword(s):

Machine Learning ◽

Theoretical Result ◽

Boolean Functions ◽

Analogical Reasoning ◽

Empirical Investigation ◽

Ground Truth ◽

Training Set ◽

Free Extension ◽

Training Sets

Training set extension is an important issue in machine learning. Indeed when the examples at hand are in a limited quantity, the performances of standard classifiers may significantly decrease and it can be helpful to build additional examples. In this paper, we consider the use of analogical reasoning, and more particularly of analogical proportions for extending training sets. Here the ground truth labels are considered to be given by a (partially known) function. We examine the conditions that are required for such functions to ensure an error-free extension in a Boolean setting. To this end, we introduce the notion of Analogy Preserving (AP) functions, and we prove that their class is the class of affine Boolean functions. This noteworthy theoretical result is complemented with an empirical investigation of approximate AP functions, which suggests that they remain suitable for training set extension.

Download Full-text

Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data

Bioinformatics ◽

10.1093/bioinformatics/btz183 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3989-3995 ◽

Cited By ~ 17

Author(s):

Hongjian Li ◽

Jiangjun Peng ◽

Pavel Sidorov ◽

Yee Leung ◽

Kwong-Sak Leung ◽

...

Keyword(s):

Machine Learning ◽

Protein Structures ◽

Superior Performance ◽

Supplementary Information ◽

Scoring Functions ◽

Training Set ◽

Test Set ◽

Set Size ◽

Extreme Gradient Boosting ◽

The Impact

Abstract Motivation Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes. Results We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing. Availability and implementation https://github.com/HongjianLi/MLSF Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text