Performance of chemical structure string representations for chemical image recognition using transformers

DECIMER 1.0: deep learning for chemical image recognition using transformers

Journal of Cheminformatics ◽

10.1186/s13321-021-00538-8 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Kohulan Rajan ◽

Achim Zielesny ◽

Christoph Steinbeck

Keyword(s):

Deep Learning ◽

Image Recognition ◽

Chemical Structure ◽

Learning Approaches ◽

Chemical Structures ◽

Structure Recognition ◽

Best Fitting ◽

Preliminary Communication ◽

Computational Intelligence Methods ◽

Chemical Image

AbstractThe amount of data available on chemical structures and their properties has increased steadily over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, Optical Chemical Structure Recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50–100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.

Download Full-text

Performance of chemical structure string representations for chemical image recognition using transformers

10.33774/chemrxiv-2021-7c9wf ◽

2021 ◽

Author(s):

Kohulan Rajan ◽

Christoph Steinbeck ◽

Achim Zielesny

Keyword(s):

Deep Learning ◽

Image Recognition ◽

Chemical Structure ◽

Learning Task ◽

Image Features ◽

The Public ◽

Chemical Structures ◽

Overall Performance ◽

Chemical Image

The use of molecular string representations for deep learning in chemistry has been steadily increasing in recent years. The complexity of existing string representations, and the difficulty in creating meaningful tokens from them, lead to the development of new string representations for chemical structures. In this study, the translation of chemical structure depictions in the form of bitmap images to corresponding molecular string representations was examined. An analysis of the recently developed DeepSMILES and SELFIES representations in comparison with the most commonly used SMILES representation is presented where the ability to translate image features into string representations with transformer models was specifically tested. The SMILES representation exhibits the best overall performance whereas SELFIES guarantee valid chemical structures. DeepSMILES performs in between SMILES and SELFIES, InChIs are not appropriate for the learning task. All investigations were carried out with publicly available datasets and the code used to train and evaluate the models has been made available to the public.

Download Full-text

DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers

10.26434/chemrxiv.14479287 ◽

2021 ◽

Author(s):

Kohulan Rajan ◽

Achim Zielesny ◽

Christoph Steinbeck

Keyword(s):

Deep Learning ◽

Image Recognition ◽

Chemical Structure ◽

Learning Approaches ◽

Chemical Structures ◽

Structure Recognition ◽

Best Fitting ◽

Preliminary Communication ◽

Computational Intelligence Methods ◽

Chemical Image

The amount of data available on chemical structures and their properties has increased exponentially over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, optical chemical structure recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50-100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.

Download Full-text

DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers

10.26434/chemrxiv.14479287.v1 ◽

2021 ◽

Author(s):

Kohulan Rajan ◽

Achim Zielesny ◽

Christoph Steinbeck

Keyword(s):

Deep Learning ◽

Image Recognition ◽

Chemical Structure ◽

Learning Approaches ◽

Chemical Structures ◽

Structure Recognition ◽

Best Fitting ◽

Preliminary Communication ◽

Computational Intelligence Methods ◽

Chemical Image

The amount of data available on chemical structures and their properties has increased exponentially over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, optical chemical structure recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50-100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.

Download Full-text

DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers

10.33774/chemrxiv-2021-9j7wg-v2 ◽

2021 ◽

Author(s):

Kohulan Rajan ◽

Achim Zielesny ◽

Christoph Steinbeck

Keyword(s):

Deep Learning ◽

Image Recognition ◽

Chemical Image

Download Full-text

DECIMER Segmentation - Automated Extraction of Chemical Structure Depictions from Scientific Literature

10.26434/chemrxiv.13536950.v1 ◽

2021 ◽

Author(s):

Kohulan Rajan ◽

Henning Otto brinkhaus ◽

Maria Sorokina ◽

Achim Zielesny ◽

Christoph Steinbeck

Keyword(s):

Deep Learning ◽

Web Application ◽

Chemical Structure ◽

Scientific Literature ◽

Data Extraction ◽

Scientific Publications ◽

Automated Recognition ◽

Chemical Structures ◽

Machine Readable ◽

Chemical Image

Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at <a href="https://decimer.ai">https://decimer.ai</a>, lets the user upload a pdf file and retrieve the segmented structure depictions.<div> </div>

Download Full-text

DECIMER - Towards Deep Learning for Chemical Image Recognition

10.26434/chemrxiv.12464420.v1 ◽

2020 ◽

Author(s):

Kohulan Rajan ◽

Achim Zielesny ◽

Christoph Steinbeck

Keyword(s):

Deep Learning ◽

Image Recognition ◽

Open Data ◽

Data Representation ◽

Training Data ◽

Training Time ◽

Training Structures ◽

Training Success ◽

Traditional Approaches ◽

Chemical Image

The automatic recognition of chemical structure diagrams from the literature is an indispensable component of workflows to re-discover information about chemicals and to make it available in open-access databases. Here we report preliminary findings in our development of DECIMER (Deep lEarning for Chemical ImagE Recognition), a deep learning method based on existing show-and-tell deep neural networks which makes very few assumptions about the structure of the underlying problem. The training state reported here does not yet rival the performance of existing traditional approaches, but we present evidence that our method will reach a comparable detection power with sufficient training time. Training success of DECIMER depends on the input data representation: DeepSMILES are clearly superior over SMILES and we have preliminary indication that the recently reported SELFIES outperform DeepSMILES. An extrapolation of our results towards larger training data sizes suggest that we might be able to achieve >90% accuracy with about 60 to 100 million training structures, so that training can be completed within several months on a single GPU. This work is completely based on open-source software and open data and is available to the general public for any purpose.

Download Full-text

DECIMER: towards deep learning for chemical image recognition

Journal of Cheminformatics ◽

10.1186/s13321-020-00469-w ◽

2020 ◽

Vol 12 (1) ◽

Author(s):

Kohulan Rajan ◽

Achim Zielesny ◽

Christoph Steinbeck

Keyword(s):

Deep Learning ◽

Image Recognition ◽

Open Data ◽

Data Representation ◽

Training Data ◽

Training Time ◽

Training Structures ◽

Training Success ◽

Traditional Approaches ◽

Chemical Image

Abstract The automatic recognition of chemical structure diagrams from the literature is an indispensable component of workflows to re-discover information about chemicals and to make it available in open-access databases. Here we report preliminary findings in our development of Deep lEarning for Chemical ImagE Recognition (DECIMER), a deep learning method based on existing show-and-tell deep neural networks, which makes very few assumptions about the structure of the underlying problem. It translates a bitmap image of a molecule, as found in publications, into a SMILES. The training state reported here does not yet rival the performance of existing traditional approaches, but we present evidence that our method will reach a comparable detection power with sufficient training time. Training success of DECIMER depends on the input data representation: DeepSMILES are superior over SMILES and we have a preliminary indication that the recently reported SELFIES outperform DeepSMILES. An extrapolation of our results towards larger training data sizes suggests that we might be able to achieve near-accurate prediction with 50 to 100 million training structures. This work is entirely based on open-source software and open data and is available to the general public for any purpose.

Download Full-text

DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature

Journal of Cheminformatics ◽

10.1186/s13321-021-00496-1 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Kohulan Rajan ◽

Henning Otto Brinkhaus ◽

Maria Sorokina ◽

Achim Zielesny ◽

Christoph Steinbeck

Keyword(s):

Deep Learning ◽

Web Application ◽

Chemical Structure ◽

Scientific Literature ◽

Data Extraction ◽

Scientific Publications ◽

Automated Recognition ◽

Chemical Structures ◽

Machine Readable ◽

Chemical Image

AbstractChemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai, lets the user upload a pdf file and retrieve the segmented structure depictions.

Download Full-text

DECIMER Segmentation - Automated Extraction of Chemical Structure Depictions from Scientific Literature

10.26434/chemrxiv.13536950.v2 ◽

2021 ◽

Author(s):

Kohulan Rajan ◽

Henning Otto brinkhaus ◽

Maria Sorokina ◽

Achim Zielesny ◽

Christoph Steinbeck

Keyword(s):

Deep Learning ◽

Web Application ◽

Chemical Structure ◽

Scientific Literature ◽

Data Extraction ◽

Scientific Publications ◽

Automated Recognition ◽

Chemical Structures ◽

Machine Readable ◽

Chemical Image

Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at <a href="https://decimer.ai">https://decimer.ai</a>, lets the user upload a pdf file and retrieve the segmented structure depictions.<div> </div>

Download Full-text