Rendering Conventional Molecular Fingerprints for Virtual Screening Independent of Molecular Complexity and Size Effects

Abstract Although deep learning can automatically extract features in relatively simple tasks such as image analysis, the construction of appropriate representations remains essential for molecular predictions due to intricate molecular complexity. Additionally, it is often expensive, time-consuming, and ethically constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small and diverse datasets. In this work, we develop a self-supervised learning approach via a masking strategy to pre-train transformer models from over 700 million unlabeled molecules in multiple databases. The intrinsic chemical logic learned from this approach enables the extraction of predictive representations from task-specific molecular sequences in a fine-tuned process. To understand the importance of self-supervised learning from unlabeled molecules, we assemble three models with different combinations of databases. Moreover, we propose a new protocol based on data traits to automatically select the optimal model for a specific predictive task. To validate the proposed representation and protocol, we consider 10 benchmark datasets in addition to 38 ligand-based virtual screening datasets. Extensive validation indicates that the proposed representation and protocol show superb performance.

Download Full-text

Using Domain-specific Fingerprints Generated Through Neural Networks to Enhance Ligand-based Virtual Screening.

10.26434/chemrxiv.12894800.v1 ◽

2020 ◽

Author(s):

Janosch Menke ◽

Oliver Koch

Keyword(s):

Neural Network ◽

Neural Networks ◽

Virtual Screening ◽

Molecular Fingerprint ◽

Molecular Fingerprints ◽

Domain Specific ◽

Trained Neural Network ◽

Hidden Layer ◽

Graph Neural Networks ◽

Combine Information

Molecular fingerprints are essential for different cheminformatics approaches like similarity-based virtual screening. In this work, the concept of neural (network) fingerprints in the context of similarity search is introduced in which the activation of the last hidden layer of a trained neural network represents the molecular fingerprint. The neural fingerprint performance of five different neural network architectures was analyzed and compared to the well-established Extended Connectivity Fingerprint (ECFP) and an autoencoder-based fingerprint. This is done using a published compound dataset with known bioactivity on 160 different kinase targets. We expect neural networks to combine information about the molecular space of<br>already known bioactive compounds together with the information on the molecular structure of the query and by doing so enrich the fingerprint. The results show that indeed neural fingerprints can greatly improve the performance of similarity searches. Most importantly, it could be shown that the neural fingerprint performs well even for kinase targets that were not included in the training. Surprisingly, while Graph Neural Networks (GNNs) are thought to offer an advantageous alternative, the best performing neural fingerprints were based on traditional fully connected layers using the ECFP4 as input. The best performing kinase-specific neural fingerprint will be provided for public use.

Download Full-text

Similarity-based approaches to virtual screening

Biochemical Society Transactions ◽

10.1042/bst0310603 ◽

2003 ◽

Vol 31 (3) ◽

pp. 603-606 ◽

Cited By ~ 73

Author(s):

P. Willett

Keyword(s):

Virtual Screening ◽

Similarity Measure ◽

Similarity Measures ◽

Similarity Coefficients ◽

Molecular Fingerprints ◽

Tanimoto Coefficient ◽

Graph Theoretic

Current similarity measures for virtual screening are based on the use of molecular fingerprints and the Tanimoto coefficient. This paper describes two ways in which one can increase the effectiveness of similarity-based virtual screening: using similarity coefficients other than the Tanimoto coefficient for the comparison of molecular fingerprints; and using a graph-theoretic similarity measure based on the largest substructure common to a pair of molecules.

Download Full-text

Stacking Multiple Molecular Fingerprints for Improving Ligand-Based Virtual Screening

Intelligent Computing Theories and Application - Lecture Notes in Computer Science ◽

10.1007/978-3-319-95933-7_35 ◽

2018 ◽

pp. 279-288 ◽

Cited By ~ 3

Author(s):

Yusuke Matsuyama ◽

Takashi Ishida

Keyword(s):

Virtual Screening ◽

Molecular Fingerprints

Download Full-text

A Convolutional Neural Network for Virtual Screening of Molecular Fingerprints

Lecture Notes in Computer Science - Image Analysis and Processing – ICIAP 2019 ◽

10.1007/978-3-030-30642-7_36 ◽

2019 ◽

pp. 399-409

Author(s):

Isabella Mendolia ◽

Salvatore Contino ◽

Ugo Perricone ◽

Roberto Pirrone ◽

Edoardo Ardizzone

Keyword(s):

Neural Network ◽

Virtual Screening ◽

Convolutional Neural Network ◽

Molecular Fingerprints

Download Full-text

Machine-Learning-Enabled Virtual Screening for Inhibitors of Lysine-Specific Histone Demethylase 1

Molecules ◽

10.3390/molecules26247492 ◽

2021 ◽

Vol 26 (24) ◽

pp. 7492

Author(s):

Jiajun Zhou ◽

Shiying Wu ◽

Boon Giin Lee ◽

Tianwei Chen ◽

Ziqi He ◽

...

Keyword(s):

Machine Learning ◽

Virtual Screening ◽

Structural Information ◽

Coefficient Of Determination ◽

Support Vector ◽

Molecular Fingerprints ◽

Lysine Specific Demethylase ◽

Zinc Database ◽

Machine Learning Approach ◽

Support Vector Regressor

A machine learning approach has been applied to virtual screening for lysine specific demethylase 1 (LSD1) inhibitors. LSD1 is an important anti-cancer target. Machine learning models to predict activity were constructed using Morgan molecular fingerprints. The dataset, consisting of 931 molecules with LSD1 inhibition activity, was obtained from the ChEMBL database. An evaluation of several candidate algorithms on the main dataset revealed that the support vector regressor gave the best model, with a coefficient of determination (R2) of 0.703. Virtual screening, using this model, identified five predicted potent inhibitors from the ZINC database comprising more than 300,000 molecules. The virtual screening recovered a known inhibitor, RN1, as well as four compounds where activity against LSD1 had not previously been suggested. Thus, we performed a machine-learning-enabled virtual screening of LSD1 inhibitors using only the structural information of the molecules.

Download Full-text

Convolutional architectures for virtual screening

BMC Bioinformatics ◽

10.1186/s12859-020-03645-9 ◽

2020 ◽

Vol 21 (S8) ◽

Author(s):

Isabella Mendolia ◽

Salvatore Contino ◽

Ugo Perricone ◽

Edoardo Ardizzone ◽

Roberto Pirrone

Keyword(s):

Virtual Screening ◽

Enrichment Factor ◽

High Precision ◽

Bioactive Compounds ◽

Prediction Accuracy ◽

State Of The Art ◽

Vector Representation ◽

Early Screening ◽

Molecular Fingerprints ◽

Screening Algorithm

Abstract Background A Virtual Screening algorithm has to adapt to the different stages of this process. Early screening needs to ensure that all bioactive compounds are ranked in the first positions despite of the number of false positives, while a second screening round is aimed at increasing the prediction accuracy. Results A novel CNN architecture is presented to this aim, which predicts bioactivity of candidate compounds on CDK1 using a combination of molecular fingerprints as their vector representation, and has been trained suitably to achieve good results as regards both enrichment factor and accuracy in different screening modes (98.55% accuracy in active-only selection, and 98.88% in high precision discrimination). Conclusion The proposed architecture outperforms state-of-the-art ML approaches, and some interesting insights on molecular fingerprints are devised.

Download Full-text

Computational Prediction of Potential Inhibitors of the Main Protease of SARS-CoV-2

Frontiers in Chemistry ◽

10.3389/fchem.2020.590263 ◽

2020 ◽

Vol 8 ◽

Author(s):

Renata Abel ◽

María Paredes Ramos ◽

Qiaofeng Chen ◽

Horacio Pérez-Sánchez ◽

Flaminia Coluzzi ◽

...

Keyword(s):

Virtual Screening ◽

Drug Repurposing ◽

Computational Prediction ◽

Pulmonary Diseases ◽

Molecular Fingerprints ◽

Computational Approaches ◽

Main Protease ◽

Dynamics Simulations ◽

Potential Inhibitors ◽

Clinical Potential

The rapidly developing pandemic, known as coronavirus disease 2019 (COVID-19) and caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has recently spread across 213 countries and territories. This pandemic is a dire public health threat—particularly for those suffering from hypertension, cardiovascular diseases, pulmonary diseases, or diabetes; without approved treatments, it is likely to persist or recur. To facilitate the rapid discovery of inhibitors with clinical potential, we have applied ligand- and structure-based computational approaches to develop a virtual screening methodology that allows us to predict potential inhibitors. In this work, virtual screening was performed against two natural products databases, Super Natural II and Traditional Chinese Medicine. Additionally, we have used an integrated drug repurposing approach to computationally identify potential inhibitors of the main protease of SARS-CoV-2 in databases of drugs (both approved and withdrawn). Roughly 360,000 compounds were screened using various molecular fingerprints and molecular docking methods; of these, 80 docked compounds were evaluated in detail, and the 12 best hits from four datasets were further inspected via molecular dynamics simulations. Finally, toxicity and cytochrome inhibition profiles were computationally analyzed for the selected candidate compounds.

Download Full-text

Using Domain-specific Fingerprints Generated Through Neural Networks to Enhance Ligand-based Virtual Screening.

10.26434/chemrxiv.12894800 ◽

2020 ◽

Author(s):

Janosch Menke ◽

Oliver Koch

Keyword(s):

Neural Network ◽

Neural Networks ◽

Virtual Screening ◽

Molecular Fingerprint ◽

Molecular Fingerprints ◽

Domain Specific ◽

Trained Neural Network ◽

Hidden Layer ◽

Graph Neural Networks ◽

Combine Information

Molecular fingerprints are essential for different cheminformatics approaches like similarity-based virtual screening. In this work, the concept of neural (network) fingerprints in the context of similarity search is introduced in which the activation of the last hidden layer of a trained neural network represents the molecular fingerprint. The neural fingerprint performance of five different neural network architectures was analyzed and compared to the well-established Extended Connectivity Fingerprint (ECFP) and an autoencoder-based fingerprint. This is done using a published compound dataset with known bioactivity on 160 different kinase targets. We expect neural networks to combine information about the molecular space of<br>already known bioactive compounds together with the information on the molecular structure of the query and by doing so enrich the fingerprint. The results show that indeed neural fingerprints can greatly improve the performance of similarity searches. Most importantly, it could be shown that the neural fingerprint performs well even for kinase targets that were not included in the training. Surprisingly, while Graph Neural Networks (GNNs) are thought to offer an advantageous alternative, the best performing neural fingerprints were based on traditional fully connected layers using the ECFP4 as input. The best performing kinase-specific neural fingerprint will be provided for public use.

Download Full-text

Supplemental Material for Set Size Effects in Spatial Updating Are Independent of the Online/Offline Updating Strategy

Journal of Experimental Psychology Human Perception & Performance ◽

10.1037/xhp0000756.supp ◽

2020 ◽

Keyword(s):

Size Effects ◽

Spatial Updating ◽

Set Size ◽

Set Size Effects

Download Full-text