FragNet, a Contrastive Learning-Based Transformer Model for Clustering, Interpreting, Visualizing, and Navigating Chemical Space

Aditya Divyakant Shrivastava; Douglas B. Kell

doi:10.3390/molecules26072065

FragNet, a Contrastive Learning-Based Transformer Model for Clustering, Interpreting, Visualizing, and Navigating Chemical Space

Molecules ◽

10.3390/molecules26072065 ◽

2021 ◽

Vol 26 (7) ◽

pp. 2065

Author(s):

Aditya Divyakant Shrivastava ◽

Douglas B. Kell

Keyword(s):

Euclidean Distance ◽

Molecular Similarity ◽

Chemical Space ◽

Pairwise Comparison ◽

Perceived Similarity ◽

Training Set ◽

Molecular Fingerprints ◽

Latent Space ◽

Transformer Model ◽

Effective Dimensionality

The question of molecular similarity is core in cheminformatics and is usually assessed via a pairwise comparison based on vectors of properties or molecular fingerprints. We recently exploited variational autoencoders to embed 6M molecules in a chemical space, such that their (Euclidean) distance within the latent space so formed could be assessed within the framework of the entire molecular set. However, the standard objective function used did not seek to manipulate the latent space so as to cluster the molecules based on any perceived similarity. Using a set of some 160,000 molecules of biological relevance, we here bring together three modern elements of deep learning to create a novel and disentangled latent space, viz transformers, contrastive learning, and an embedded autoencoder. The effective dimensionality of the latent space was varied such that clear separation of individual types of molecules could be observed within individual dimensions of the latent space. The capacity of the network was such that many dimensions were not populated at all. As before, we assessed the utility of the representation by comparing clozapine with its near neighbors, and we also did the same for various antibiotics related to flucloxacillin. Transformers, especially when as here coupled with contrastive learning, effectively provide one-shot learning and lead to a successful and disentangled representation of molecular latent spaces that at once uses the entire training set in their construction while allowing “similar” molecules to cluster together in an effective and interpretable way.

Download Full-text

Use of Molecular Similarity Indices for QSAR Training Set Selection

SAR and QSAR in Environmental Research ◽

10.1080/10629369508050154 ◽

1995 ◽

Vol 3 (4) ◽

pp. 279-292 ◽

Cited By ~ 1

Author(s):

I. T. Cousins ◽

M. T. D. Cronin ◽

J. C. Dearden ◽

C. D. Watts

Keyword(s):

Molecular Similarity ◽

Training Set ◽

Similarity Indices ◽

Training Set Selection

Download Full-text

A Novel Query Strategy-Based Rank Batch-Mode Active Learning Method for High-Resolution Remote Sensing Image Classification

Remote Sensing ◽

10.3390/rs13112234 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2234

Author(s):

Xin Luo ◽

Huaqiang Du ◽

Guomo Zhou ◽

Xuejian Li ◽

Fangjie Mao ◽

...

Keyword(s):

Land Use ◽

Active Learning ◽

Euclidean Distance ◽

Urban Land Use ◽

Misclassification Rate ◽

Training Set ◽

Batch Mode ◽

Land Use Types ◽

Information Divergence ◽

Uncertainty Score

An informative training set is necessary for ensuring the robust performance of the classification of very-high-resolution remote sensing (VHRRS) images, but labeling work is often difficult, expensive, and time-consuming. This makes active learning (AL) an important part of an image analysis framework. AL aims to efficiently build a representative and efficient library of training samples that are most informative for the underlying classification task, thereby minimizing the cost of obtaining labeled data. Based on ranked batch-mode active learning (RBMAL), this paper proposes a novel combined query strategy of spectral information divergence lowest confidence uncertainty sampling (SIDLC), called RBSIDLC. The base classifier of random forest (RF) is initialized by using a small initial training set, and each unlabeled sample is analyzed to obtain the classification uncertainty score. A spectral information divergence (SID) function is then used to calculate the similarity score, and according to the final score, the unlabeled samples are ranked in descending lists. The most “valuable” samples are selected according to ranked lists and then labeled by the analyst/expert (also called the oracle). Finally, these samples are added to the training set, and the RF is retrained for the next iteration. The whole procedure is iteratively implemented until a stopping criterion is met. The results indicate that RBSIDLC achieves high-precision extraction of urban land use information based on VHRRS; the accuracy of extraction for each land-use type is greater than 90%, and the overall accuracy (OA) is greater than 96%. After the SID replaces the Euclidean distance in the RBMAL algorithm, the RBSIDLC method greatly reduces the misclassification rate among different land types. Therefore, the similarity function based on SID performs better than that based on the Euclidean distance. In addition, the OA of RF classification is greater than 90%, suggesting that it is feasible to use RF to estimate the uncertainty score. Compared with the three single query strategies of other AL methods, sample labeling with the SIDLC combined query strategy yields a lower cost and higher quality, thus effectively reducing the misclassification rate of different land use types. For example, compared with the Batch_Based_Entropy (BBE) algorithm, RBSIDLC improves the precision of barren land extraction by 37% and that of vegetation by 14%. The 25 characteristics of different land use types screened by RF cross-validation (RFCV) combined with the permutation method exhibit an excellent separation degree, and the results provide the basis for VHRRS information extraction in urban land use settings based on RBSIDLC.

Download Full-text

A New Measure of Pulse Rate Variability and Detection of Atrial Fibrillation Based on Improved Time Synchronous Averaging

Computational and Mathematical Methods in Medicine ◽

10.1155/2021/5597559 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Xiaodong Ding ◽

Yiqin Wang ◽

Yiming Hao ◽

Yi Lv ◽

Rui Chen ◽

...

Keyword(s):

Atrial Fibrillation ◽

Pulse Rate ◽

Euclidean Distance ◽

Operating Characteristic ◽

Pulse Wave ◽

Characteristic Curve ◽

Signal Quality ◽

Training Set ◽

Pulse Rate Variability ◽

Operating Characteristic Curve

Background. Pulse rate variability monitoring and atrial fibrillation detection algorithms have been widely used in wearable devices, but the accuracies of these algorithms are restricted by the signal quality of pulse wave. Time synchronous averaging is a powerful noise reduction method for periodic and approximately periodic signals. It is usually used to extract single-period pulse waveforms, but has nothing to do with pulse rate variability monitoring and atrial fibrillation detection traditionally. If this method is improved properly, it may provide a new way to measure pulse rate variability and to detect atrial fibrillation, which may have some potential advantages under the condition of poor signal quality. Objective. The objective of this paper was to develop a new measure of pulse rate variability by improving existing time synchronous averaging and to detect atrial fibrillation by the new measure of pulse rate variability. Methods. During time synchronous averaging, two adjacent periods were regarded as the basic unit to calculate the average signal, and the difference between waveforms of the two adjacent periods was the new measure of pulse rate variability. 3 types of distance measures (Euclidean distance, Manhattan distance, and cosine distance) were tested to measure this difference on a simulated training set with a capacity of 1000. The distance measure, which can accurately distinguish regular pulse rate and irregular pulse rate, was used to detect atrial fibrillation on the testing set with a capacity of 62 (11 with atrial fibrillation, 8 with premature contraction, and 43 with sinus rhythm). The receiver operating characteristic curve was used to evaluate the performance of the indexes. Results. The Euclidean distance between waveforms of the two adjacent periods performs best on the training set. On the testing set, the Euclidean distance in atrial fibrillation group is significantly higher than that of the other two groups. The area under receiver operating characteristic curve to identify atrial fibrillation was 0.998. With the threshold of 2.1, the accuracy, sensitivity, and specificity were 98.39%, 100%, and 98.04%, respectively. This new index can detect atrial fibrillation from pulse wave signal. Conclusion. This algorithm not only provides a new perspective to detect AF but also accomplishes the monitoring of PRV and the extraction of single-period pulse wave through the same technical route, which may promote the popularization and application of pulse wave.

Download Full-text

Generating stable molecules using imitation and reinforcement learning

Machine Learning: Science and Technology ◽

10.1088/2632-2153/ac3eb4 ◽

2021 ◽

Author(s):

Søren Ager Meldgaard ◽

Jonas Köhler ◽

Henrik Lund Mortensen ◽

Mads-Peter Verner Christiansen ◽

Frank Noé ◽

...

Keyword(s):

Reinforcement Learning ◽

Chemical Space ◽

Training Data ◽

Graph Representation ◽

Imitation Learning ◽

Training Set ◽

Machine Learning Methods ◽

Multiple Copies ◽

The Stability ◽

3D Information

Abstract Chemical space is routinely explored by machine learning methods to discover interesting molecules, before time-consuming experimental synthesizing is attempted. However, these methods often rely on a graph representation, ignoring 3D information necessary for determining the stability of the molecules. We propose a reinforcement learning approach for generating molecules in cartesian coordinates allowing for quantum chemical prediction of the stability. To improve sample-efficiency we learn basic chemical rules from imitation learning on the GDB-11 database to create an initial model applicable for all stoichiometries. We then deploy multiple copies of the model conditioned on a specific stoichiometry in a reinforcement learning setting. The models correctly identify low energy molecules in the database and produce novel isomers not found in the training set. Finally, we apply the model to larger molecules to show how reinforcement learning further refines the imitation learning model in domains far from the training data.

Download Full-text

Introduction to Molecular Similarity and Chemical Space

Foodinformatics ◽

10.1007/978-3-319-10226-9_1 ◽

2014 ◽

pp. 1-81 ◽

Cited By ~ 2

Author(s):

Gerald M. Maggiora

Keyword(s):

Molecular Similarity ◽

Chemical Space

Download Full-text

BIOFACQUIM: A Mexican Compound Database of Natural Products

Biomolecules ◽

10.3390/biom9010031 ◽

2019 ◽

Vol 9 (1) ◽

pp. 31 ◽

Cited By ~ 20

Author(s):

B. Pilón-Jiménez ◽

Fernanda Saldívar-González ◽

Bárbara Díaz-Eufracio ◽

José Medina-Franco

Keyword(s):

Natural Products ◽

Drug Discovery ◽

Physicochemical Properties ◽

Chemical Space ◽

Structural Diversity ◽

Proof Of Concept ◽

Molecular Fingerprints ◽

The Public ◽

Compound Database

Compound databases of natural products have a major impact on drug discovery projects and other areas of research. The number of databases in the public domain with compounds with natural origins is increasing. Several countries, Brazil, France, Panama and, recently, Vietnam, have initiatives in place to construct and maintain compound databases that are representative of their diversity. In this proof-of-concept study, we discuss the first version of BIOFACQUIM, a novel compound database with natural products isolated and characterized in Mexico. We discuss its construction, curation, and a complete chemoinformatic characterization of the content and coverage in chemical space. The profile of physicochemical properties, scaffold content, and diversity, as well as structural diversity based on molecular fingerprints is reported. BIOFACQUIM is available for free.

Download Full-text

Efficient Multi-Objective Molecular Optimization in a Continuous Latent Space

10.26434/chemrxiv.7971101.v1 ◽

2019 ◽

Author(s):

Robin Winter ◽

Floriane Montanari ◽

Andreas Steffen ◽

Hans Briem ◽

Frank Noé ◽

...

Keyword(s):

Objective Function ◽

In Silico ◽

Prediction Models ◽

Chemical Space ◽

In Silico Prediction ◽

Swarm Optimization ◽

Starting Compound ◽

Latent Space ◽

Novel Method ◽

Short Time

In this work, we propose a novel method that combines in silico prediction of molecular properties such as biological activity or pharmacokinetics with an in silico optimization algorithm, namely Particle Swarm Optimization. Our method takes a starting compound as input and proposes new molecules with more desirable (predicted) properties. It navigates a machine-learned continuous representation of a drug-like chemical space guided by a de fined objective function. The objective function combines multiple in silico prediction models, de fined desirability ranges and substructure constraints. We demonstrate that our proposed method is able to consistently fi nd more desirable molecules for the studied tasks in relatively short time.

Download Full-text

A De Novo Molecular Generation Method Using Latent Vector Based Generative Adversarial Network

10.26434/chemrxiv.8299544.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Oleksii Prykhodko ◽

Simon Viet Johansson ◽

Panagiotis-Christos Kotsias ◽

Esben Jannik Bjerrum ◽

Ola Engkvist ◽

...

Keyword(s):

Deep Learning ◽

De Novo ◽

Chemical Space ◽

Learning Method ◽

Training Set ◽

Generative Adversarial Network ◽

Structure Generation ◽

Adversarial Network ◽

Molecule Design ◽

Novel Structures

Recently deep learning method has been used for generating novel structures. In the current study, we proposed a new deep learning method, LatentGAN, which combine an autoencoder and a generative adversarial neural network for doing de novo molecule design. We applied the method for structure generation in two scenarios, one is to generate random drug-like compounds and the other is to generate target biased compounds. Our results show that the method works well in both cases, in which sampled compounds from the trained model can largely occupy the same chemical space of the training set and still a substantial fraction of the generated compound are novel. The distribution of drug-likeness score for compounds sampled from LatentGAN is also similar to that of the training set.

Download Full-text

A De Novo Molecular Generation Method Using Latent Vector Based Generative Adversarial Network

10.26434/chemrxiv.8299544.v3 ◽

2019 ◽

Cited By ~ 1

Author(s):

Oleksii Prykhodko ◽

Simon Viet Johansson ◽

Panagiotis-Christos Kotsias ◽

Josep Arús-Pous ◽

Esben Jannik Bjerrum ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

De Novo ◽

Molecular Design ◽

Chemical Space ◽

Training Set ◽

Generative Adversarial Network ◽

Adversarial Network ◽

De Novo Molecular Design ◽

Novel Structures

Deep learning methods applied to drug discovery have been used to generate novel structures. In this study, we propose a new deep learning architecture, LatentGAN, which combines an autoencoder and a generative adversarial neural network for de novo molecular design. We applied the method in two scenarios: one to generate random drug-like compounds and another to generate target-biased compounds. Our results show that the method works well in both cases: sampled compounds from the trained model can largely occupy the same chemical space as the training set and also generate a substantial fraction of novel compounds. Moreover, the drug-likeness score of compounds sampled from LatentGAN is also similar to that of the training set. Lastly, generated compounds differ from those obtained with a Recurrent Neural Network-based generative model approach, indicating that both methods can be used complementarily.

Download Full-text

Randomized SMILES strings improve the quality of molecular generative models

Journal of Cheminformatics ◽

10.1186/s13321-019-0393-0 ◽

2019 ◽

Vol 11 (1) ◽

Cited By ~ 22

Author(s):

Josep Arús-Pous ◽

Simon Viet Johansson ◽

Oleksii Prykhodko ◽

Esben Jannik Bjerrum ◽

Christian Tyrchan ◽

...

Keyword(s):

Recurrent Neural Networks ◽

Chemical Space ◽

Cell Types ◽

Generative Models ◽

The Other ◽

Probability Models ◽

Training Set ◽

String Representation ◽

Almost All

AbstractRecurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES.

Download Full-text