scholarly journals Using Machine Learning to Collect and Facilitate Remote Access to Biomedical Databases: Development of the Biomedical Database Inventory

10.2196/22976 ◽  
2021 ◽  
Vol 9 (2) ◽  
pp. e22976
Author(s):  
Eduardo Rosado ◽  
Miguel Garcia-Remesal ◽  
Sergio Paraiso-Medina ◽  
Alejandro Pazos ◽  
Victor Maojo

Background Currently, existing biomedical literature repositories do not commonly provide users with specific means to locate and remotely access biomedical databases. Objective To address this issue, we developed the Biomedical Database Inventory (BiDI), a repository linking to biomedical databases automatically extracted from the scientific literature. BiDI provides an index of data resources and a path to access them seamlessly. Methods We designed an ensemble of deep learning methods to extract database mentions. To train the system, we annotated a set of 1242 articles that included mentions of database publications. Such a data set was used along with transfer learning techniques to train an ensemble of deep learning natural language processing models targeted at database publication detection. Results The system obtained an F1 score of 0.929 on database detection, showing high precision and recall values. When applying this model to the PubMed and PubMed Central databases, we identified over 10,000 unique databases. The ensemble model also extracted the weblinks to the reported databases and discarded irrelevant links. For the extraction of weblinks, the model achieved a cross-validated F1 score of 0.908. We show two use cases: one related to “omics” and the other related to the COVID-19 pandemic. Conclusions BiDI enables access to biomedical resources over the internet and facilitates data-driven research and other scientific initiatives. The repository is openly available online and will be regularly updated with an automatic text processing pipeline. The approach can be reused to create repositories of different types (ie, biomedical and others).

2020 ◽  
Author(s):  
Eduardo Rosado ◽  
Miguel Garcia-Remesal Sr ◽  
Sergio Paraiso-Medina Sr ◽  
Alejandro Pazos Sr ◽  
Victor Maojo Sr

BACKGROUND Currently, existing biomedical literature repositories do not commonly provide users with specific means to locate and remotely access biomedical databases. OBJECTIVE To address this issue we developed BiDI (Biomedical Database Inventory), a repository linking to biomedical databases automatically extracted from the scientific literature. BiDI provides an index of data resources and a path to access them in a seamless manner. METHODS We designed an ensemble of Deep Learning methods to extract database mentions. To train the system we annotated a set of 1,242 articles that included mentions to database publications. Such a dataset was used along with transfer learning techniques to train an ensemble of deep learning NLP models based on the task of database publication detection. RESULTS The system obtained an f1-score of 0.929 on database detection, showing high precision and recall values. Applying this model to the PubMed and PubMed Central databases we identified over 10,000 unique databases. The ensemble also extracts the web links to the reported databases, discarding the irrelevant links. For the extraction of web links the model achieved a cross-validated f1-score of 0.908. We show two use cases, related to “omics” and the COVID-19 pandemia. CONCLUSIONS BiDI enables the access of biomedical resources over the Internet and facilitates data-driven research and other scientific initiatives. The repository is available at (http://gib.fi.upm.es/bidi/) and will be regularly updated with an automatic text processing pipeline. The approach can be reused to create repositories of different types (biomedical and others).


2021 ◽  
Author(s):  
Chun-chao Lo ◽  
Shubo Tian ◽  
Yuchuan Tao ◽  
Jie Hao ◽  
Jinfeng Zhang

Most queries submitted to a literature search engine can be more precisely written as sentences to give the search engine more specific information. Sentence queries should be more effective, in principle, than short queries with small numbers of keywords. Querying with full sentences is also a key step in question-answering and citation recommendation systems. Despite the considerable progress in natural language processing (NLP) in recent years, using sentence queries on current search engines does not yield satisfactory results. In this study, we developed a deep learning-based method for sentence queries, called DeepSenSe, using citation data available in full-text articles obtained from PubMed Central (PMC). A large amount of labeled data was generated from millions of matched citing sentences and cited articles, making it possible to train quality predictive models using modern deep learning techniques. A two-stage approach was designed: in the first stage we used a modified BM25 algorithm to obtain the top 1000 relevant articles; the second stage involved re-ranking the relevant articles using DeepSenSe. We tested our method using a large number of sentences extracted from real scientific articles in PMC. Our method performed substantially better than PubMed and Google Scholar for sentence queries.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Molham Al-Maleh ◽  
Said Desouki

AbstractNatural language processing has witnessed remarkable progress with the advent of deep learning techniques. Text summarization, along other tasks like text translation and sentiment analysis, used deep neural network models to enhance results. The new methods of text summarization are subject to a sequence-to-sequence framework of encoder–decoder model, which is composed of neural networks trained jointly on both input and output. Deep neural networks take advantage of big datasets to improve their results. These networks are supported by the attention mechanism, which can deal with long texts more efficiently by identifying focus points in the text. They are also supported by the copy mechanism that allows the model to copy words from the source to the summary directly. In this research, we are re-implementing the basic summarization model that applies the sequence-to-sequence framework on the Arabic language, which has not witnessed the employment of this model in the text summarization before. Initially, we build an Arabic data set of summarized article headlines. This data set consists of approximately 300 thousand entries, each consisting of an article introduction and the headline corresponding to this introduction. We then apply baseline summarization models to the previous data set and compare the results using the ROUGE scale.


2022 ◽  
pp. 1-12
Author(s):  
Amin Ul Haq ◽  
Jian Ping Li ◽  
Samad Wali ◽  
Sultan Ahmad ◽  
Zafar Ali ◽  
...  

Artificial intelligence (AI) based computer-aided diagnostic (CAD) systems can effectively diagnose critical disease. AI-based detection of breast cancer (BC) through images data is more efficient and accurate than professional radiologists. However, the existing AI-based BC diagnosis methods have complexity in low prediction accuracy and high computation time. Due to these reasons, medical professionals are not employing the current proposed techniques in E-Healthcare to effectively diagnose the BC. To diagnose the breast cancer effectively need to incorporate advanced AI techniques based methods in diagnosis process. In this work, we proposed a deep learning based diagnosis method (StackBC) to detect breast cancer in the early stage for effective treatment and recovery. In particular, we have incorporated deep learning models including Convolutional neural network (CNN), Long short term memory (LSTM), and Gated recurrent unit (GRU) for the classification of Invasive Ductal Carcinoma (IDC). Additionally, data augmentation and transfer learning techniques have been incorporated for data set balancing and for effective training the model. To further improve the predictive performance of model we used stacking technique. Among the three base classifiers (CNN, LSTM, GRU) the predictive performance of GRU are better as compared to individual model. The GRU is selected as a meta classifier to distinguish between Non-IDC and IDC breast images. The method Hold-Out has been incorporated and the data set is split into 90% and 10% for training and testing of the model, respectively. Model evaluation metrics have been computed for model performance evaluation. To analyze the efficacy of the model, we have used breast histology images data set. Our experimental results demonstrated that the proposed StackBC method achieved improved performance by gaining 99.02% accuracy and 100% area under the receiver operating characteristics curve (AUC-ROC) compared to state-of-the-art methods. Due to the high performance of the proposed method, we recommend it for early recognition of breast cancer in E-Healthcare.


2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Juncai Li ◽  
Xiaofei Jiang

Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.


2021 ◽  
Vol 9 (2) ◽  
pp. 1051-1052
Author(s):  
K. Kavitha, Et. al.

Sentiments is the term of opinion or views about any topic expressed by the people through a source of communication. Nowadays social media is an effective platform for people to communicate and it generates huge amount of unstructured details every day. It is essential for any business organization in the current era to process and analyse the sentiments by using machine learning and Natural Language Processing (NLP) strategies. Even though in recent times the deep learning strategies are becoming more familiar due to higher capabilities of performance. This paper represents an empirical study of an application of deep learning techniques in Sentiment Analysis (SA) for sarcastic messages and their increasing scope in real time. Taxonomy of the sentiment analysis in recent times and their key terms are also been highlighted in the manuscript. The survey concludes the recent datasets considered, their key contributions and the performance of deep learning model applied with its primary purpose like sarcasm detection in order to describe the efficiency of deep learning frameworks in the domain of sentimental analysis.


2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Kazi Nabiul Alam ◽  
Md Shakib Khan ◽  
Abdur Rab Dhruba ◽  
Mohammad Monirujjaman Khan ◽  
Jehad F. Al-Amri ◽  
...  

The COVID-19 pandemic has had a devastating effect on many people, creating severe anxiety, fear, and complicated feelings or emotions. After the initiation of vaccinations against coronavirus, people’s feelings have become more diverse and complex. Our aim is to understand and unravel their sentiments in this research using deep learning techniques. Social media is currently the best way to express feelings and emotions, and with the help of Twitter, one can have a better idea of what is trending and going on in people’s minds. Our motivation for this research was to understand the diverse sentiments of people regarding the vaccination process. In this research, the timeline of the collected tweets was from December 21 to July21. The tweets contained information about the most common vaccines available recently from across the world. The sentiments of people regarding vaccines of all sorts were assessed using the natural language processing (NLP) tool, Valence Aware Dictionary for sEntiment Reasoner (VADER). Initializing the polarities of the obtained sentiments into three groups (positive, negative, and neutral) helped us visualize the overall scenario; our findings included 33.96% positive, 17.55% negative, and 48.49% neutral responses. In addition, we included our analysis of the timeline of the tweets in this research, as sentiments fluctuated over time. A recurrent neural network- (RNN-) oriented architecture, including long short-term memory (LSTM) and bidirectional LSTM (Bi-LSTM), was used to assess the performance of the predictive models, with LSTM achieving an accuracy of 90.59% and Bi-LSTM achieving 90.83%. Other performance metrics such as precision,, F1-score, and a confusion matrix were also used to validate our models and findings more effectively. This study improves understanding of the public’s opinion on COVID-19 vaccines and supports the aim of eradicating coronavirus from the world.


2021 ◽  
Author(s):  
Saniya Karnik ◽  
Navya Yenuganti ◽  
Bonang Firmansyah Jusri ◽  
Supriya Gupta ◽  
Prasanna Nirgudkar ◽  
...  

Abstract Today, Electrical Submersible Pump (ESP) failure analysis is a tedious, human-intensive, and time-consuming activity involving dismantle, inspection, and failure analysis (DIFA) for each failure. This paper presents a novel artificial intelligence workflow using an ensemble of machine learning (ML) algorithms coupled with natural language processing (NLP) and deep learning (DL). The algorithms outlined in this paper bring together structured and unstructured data across equipment, production, operations, and failure reports to automate root cause identification and analysis post breakdown. This process will result in reduced turnaround time (TAT) and human effort thus drastically improving process efficiency.


2012 ◽  
pp. 13-22 ◽  
Author(s):  
João Gama ◽  
André C.P.L.F. de Carvalho

Machine learning techniques have been successfully applied to several real world problems in areas as diverse as image analysis, Semantic Web, bioinformatics, text processing, natural language processing,telecommunications, finance, medical diagnosis, and so forth. A particular application where machine learning plays a key role is data mining, where machine learning techniques have been extensively used for the extraction of association, clustering, prediction, diagnosis, and regression models. This text presents our personal view of the main aspects, major tasks, frequently used algorithms, current research, and future directions of machine learning research. For such, it is organized as follows: Background information concerning machine learning is presented in the second section. The third section discusses different definitions for Machine Learning. Common tasks faced by Machine Learning Systems are described in the fourth section. Popular Machine Learning algorithms and the importance of the loss function are commented on in the fifth section. The sixth and seventh sections present the current trends and future research directions, respectively.


Sign in / Sign up

Export Citation Format

Share Document