Rendering Conventional Molecular Fingerprints for Virtual Screening Independent of Molecular Complexity and Size Effects

ChemMedChem ◽  
2010 ◽  
Vol 5 (6) ◽  
pp. 859-868 ◽  
Author(s):  
Britta Nisius ◽  
Jürgen Bajorath
2021 ◽  
Author(s):  
Dong Chen ◽  
Guowei Wei ◽  
Feng Pan

Abstract Although deep learning can automatically extract features in relatively simple tasks such as image analysis, the construction of appropriate representations remains essential for molecular predictions due to intricate molecular complexity. Additionally, it is often expensive, time-consuming, and ethically constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small and diverse datasets. In this work, we develop a self-supervised learning approach via a masking strategy to pre-train transformer models from over 700 million unlabeled molecules in multiple databases. The intrinsic chemical logic learned from this approach enables the extraction of predictive representations from task-specific molecular sequences in a fine-tuned process. To understand the importance of self-supervised learning from unlabeled molecules, we assemble three models with different combinations of databases. Moreover, we propose a new protocol based on data traits to automatically select the optimal model for a specific predictive task. To validate the proposed representation and protocol, we consider 10 benchmark datasets in addition to 38 ligand-based virtual screening datasets. Extensive validation indicates that the proposed representation and protocol show superb performance.


2020 ◽  
Author(s):  
Janosch Menke ◽  
Oliver Koch

Molecular fingerprints are essential for different cheminformatics approaches like similarity-based virtual screening. In this work, the concept of neural (network) fingerprints in the context of similarity search is introduced in which the activation of the last hidden layer of a trained neural network represents the molecular fingerprint. The neural fingerprint performance of five different neural network architectures was analyzed and compared to the well-established Extended Connectivity Fingerprint (ECFP) and an autoencoder-based fingerprint. This is done using a published compound dataset with known bioactivity on 160 different kinase targets. We expect neural networks to combine information about the molecular space of<br>already known bioactive compounds together with the information on the molecular structure of the query and by doing so enrich the fingerprint. The results show that indeed neural fingerprints can greatly improve the performance of similarity searches. Most importantly, it could be shown that the neural fingerprint performs well even for kinase targets that were not included in the training. Surprisingly, while Graph Neural Networks (GNNs) are thought to offer an advantageous alternative, the best performing neural fingerprints were based on traditional fully connected layers using the ECFP4 as input. The best performing kinase-specific neural fingerprint will be provided for public use.


2003 ◽  
Vol 31 (3) ◽  
pp. 603-606 ◽  
Author(s):  
P. Willett

Current similarity measures for virtual screening are based on the use of molecular fingerprints and the Tanimoto coefficient. This paper describes two ways in which one can increase the effectiveness of similarity-based virtual screening: using similarity coefficients other than the Tanimoto coefficient for the comparison of molecular fingerprints; and using a graph-theoretic similarity measure based on the largest substructure common to a pair of molecules.


Molecules ◽  
2021 ◽  
Vol 26 (24) ◽  
pp. 7492
Author(s):  
Jiajun Zhou ◽  
Shiying Wu ◽  
Boon Giin Lee ◽  
Tianwei Chen ◽  
Ziqi He ◽  
...  

A machine learning approach has been applied to virtual screening for lysine specific demethylase 1 (LSD1) inhibitors. LSD1 is an important anti-cancer target. Machine learning models to predict activity were constructed using Morgan molecular fingerprints. The dataset, consisting of 931 molecules with LSD1 inhibition activity, was obtained from the ChEMBL database. An evaluation of several candidate algorithms on the main dataset revealed that the support vector regressor gave the best model, with a coefficient of determination (R2) of 0.703. Virtual screening, using this model, identified five predicted potent inhibitors from the ZINC database comprising more than 300,000 molecules. The virtual screening recovered a known inhibitor, RN1, as well as four compounds where activity against LSD1 had not previously been suggested. Thus, we performed a machine-learning-enabled virtual screening of LSD1 inhibitors using only the structural information of the molecules.


2020 ◽  
Vol 21 (S8) ◽  
Author(s):  
Isabella Mendolia ◽  
Salvatore Contino ◽  
Ugo Perricone ◽  
Edoardo Ardizzone ◽  
Roberto Pirrone

Abstract Background A Virtual Screening algorithm has to adapt to the different stages of this process. Early screening needs to ensure that all bioactive compounds are ranked in the first positions despite of the number of false positives, while a second screening round is aimed at increasing the prediction accuracy. Results A novel CNN architecture is presented to this aim, which predicts bioactivity of candidate compounds on CDK1 using a combination of molecular fingerprints as their vector representation, and has been trained suitably to achieve good results as regards both enrichment factor and accuracy in different screening modes (98.55% accuracy in active-only selection, and 98.88% in high precision discrimination). Conclusion The proposed architecture outperforms state-of-the-art ML approaches, and some interesting insights on molecular fingerprints are devised.


2020 ◽  
Vol 8 ◽  
Author(s):  
Renata Abel ◽  
María Paredes Ramos ◽  
Qiaofeng Chen ◽  
Horacio Pérez-Sánchez ◽  
Flaminia Coluzzi ◽  
...  

The rapidly developing pandemic, known as coronavirus disease 2019 (COVID-19) and caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has recently spread across 213 countries and territories. This pandemic is a dire public health threat—particularly for those suffering from hypertension, cardiovascular diseases, pulmonary diseases, or diabetes; without approved treatments, it is likely to persist or recur. To facilitate the rapid discovery of inhibitors with clinical potential, we have applied ligand- and structure-based computational approaches to develop a virtual screening methodology that allows us to predict potential inhibitors. In this work, virtual screening was performed against two natural products databases, Super Natural II and Traditional Chinese Medicine. Additionally, we have used an integrated drug repurposing approach to computationally identify potential inhibitors of the main protease of SARS-CoV-2 in databases of drugs (both approved and withdrawn). Roughly 360,000 compounds were screened using various molecular fingerprints and molecular docking methods; of these, 80 docked compounds were evaluated in detail, and the 12 best hits from four datasets were further inspected via molecular dynamics simulations. Finally, toxicity and cytochrome inhibition profiles were computationally analyzed for the selected candidate compounds.


2020 ◽  
Author(s):  
Janosch Menke ◽  
Oliver Koch

Molecular fingerprints are essential for different cheminformatics approaches like similarity-based virtual screening. In this work, the concept of neural (network) fingerprints in the context of similarity search is introduced in which the activation of the last hidden layer of a trained neural network represents the molecular fingerprint. The neural fingerprint performance of five different neural network architectures was analyzed and compared to the well-established Extended Connectivity Fingerprint (ECFP) and an autoencoder-based fingerprint. This is done using a published compound dataset with known bioactivity on 160 different kinase targets. We expect neural networks to combine information about the molecular space of<br>already known bioactive compounds together with the information on the molecular structure of the query and by doing so enrich the fingerprint. The results show that indeed neural fingerprints can greatly improve the performance of similarity searches. Most importantly, it could be shown that the neural fingerprint performs well even for kinase targets that were not included in the training. Surprisingly, while Graph Neural Networks (GNNs) are thought to offer an advantageous alternative, the best performing neural fingerprints were based on traditional fully connected layers using the ECFP4 as input. The best performing kinase-specific neural fingerprint will be provided for public use.


Sign in / Sign up

Export Citation Format

Share Document