Subsentence Extraction from Text Using Coverage-Based Deep Learning Language Models

JongYoon Lim; Inkyu Sa; Ho Seok Ahn; Norina Gasteiger; Sanghyub John Lee; Bruce MacDonald

doi:10.3390/s21082712

Subsentence Extraction from Text Using Coverage-Based Deep Learning Language Models

Sensors ◽

10.3390/s21082712 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2712

Author(s):

JongYoon Lim ◽

Inkyu Sa ◽

Ho Seok Ahn ◽

Norina Gasteiger ◽

Sanghyub John Lee ◽

...

Keyword(s):

Deep Learning ◽

Language Processing ◽

Auxiliary Information ◽

Language Models ◽

Research Fields ◽

Software Packages ◽

Public Dataset ◽

High Degree ◽

Learning Language ◽

Important Building Block

Sentiment prediction remains a challenging and unresolved task in various research fields, including psychology, neuroscience, and computer science. This stems from its high degree of subjectivity and limited input sources that can effectively capture the actual sentiment. This can be even more challenging with only text-based input. Meanwhile, the rise of deep learning and an unprecedented large volume of data have paved the way for artificial intelligence to perform impressively accurate predictions or even human-level reasoning. Drawing inspiration from this, we propose a coverage-based sentiment and subsentence extraction system that estimates a span of input text and recursively feeds this information back to the networks. The predicted subsentence consists of auxiliary information expressing a sentiment. This is an important building block for enabling vivid and epic sentiment delivery (within the scope of this paper) and for other natural language processing tasks such as text summarisation and Q&A. Our approach outperforms the state-of-the-art approaches by a large margin in subsentence prediction (i.e., Average Jaccard scores from 0.72 to 0.89). For the evaluation, we designed rigorous experiments consisting of 24 ablation studies. Finally, our learned lessons are returned to the community by sharing software packages and a public dataset that can reproduce the results presented in this paper.

Download Full-text

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

AI ◽

10.3390/ai2010001 ◽

2021 ◽

Vol 2 (1) ◽

pp. 1-16

Author(s):

Juan Cruz-Benito ◽

Sanjay Vishwakarma ◽

Francisco Martin-Fernandez ◽

Ismael Faro

Keyword(s):

Deep Learning ◽

Learning Community ◽

Programming Languages ◽

Language Processing ◽

Code Generation ◽

Language Model ◽

Language Models ◽

Stochastic Gradient Descent ◽

Network Architectures ◽

Learning Architectures

In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.

Download Full-text

Astrid

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436907 ◽

2020 ◽

Vol 14 (4) ◽

pp. 471-484

Author(s):

Suraj Shetiya ◽

Saravanan Thirumuruganathan ◽

Nick Koudas ◽

Gautam Das

Keyword(s):

Deep Learning ◽

Objective Function ◽

Pattern Matching ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Selectivity Estimation ◽

Statistical Correlations ◽

Benchmark Datasets ◽

Traditional Approaches

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Download Full-text

Senti-BAS: A BERT-based model with sentiment computing for happiness research (Preprint)

10.2196/preprints.27914 ◽

2021 ◽

Author(s):

Zeyuan Zeng ◽

Yijia Zhang ◽

Liang Yang ◽

Hongfei Lin

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Language Processing ◽

High Accuracy ◽

Language Models ◽

Fine Grained ◽

Label Information ◽

Common Criterion ◽

Text Content ◽

Sentiment Computing

BACKGROUND Happiness becomes a rising topic that we all care about recently. It can be described in various forms. For the text content, it is an interesting subject that we can do research on happiness by utilizing natural language processing (NLP) methods. OBJECTIVE As an abstract and complicated emotion, there is no common criterion to measure and describe happiness. Therefore, researchers are creating different models to study and measure happiness. METHODS In this paper, we present a deep-learning based model called Senti-BAS (BERT embedded Bi-LSTM with self-Attention mechanism along with the Sentiment computing). RESULTS Given a sentence that describes how a person felt happiness recently, the model can classify the happiness scenario in the sentence with two topics: was it controlled by the author (label ‘agency’), and was it involving other people (label ‘social’). Besides language models, we employ the label information through sentiment computing based on lexicon. CONCLUSIONS The model performs with a high accuracy on both ‘agency’ and ‘social’ labels, and we also make comparisons with several popular embedding models like Elmo, GPT. Depending on our work, we can study the happiness at a more fine-grained level.

Download Full-text

Text: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning

10.31234/osf.io/293kt ◽

2021 ◽

Author(s):

Oscar Nils Erik Kjell ◽

H. Andrew Schwartz ◽

Salvatore Giorgi

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rating Scale ◽

State Of The Art ◽

R Package ◽

Language Models ◽

Categorical Variables ◽

Human Language

The language that individuals use for expressing themselves contains rich psychological information. Recent significant advances in Natural Language Processing (NLP) and Deep Learning (DL), namely transformers, have resulted in large performance gains in tasks related to understanding natural language such as machine translation. However, these state-of-the-art methods have not yet been made easily accessible for psychology researchers, nor designed to be optimal for human-level analyses. This tutorial introduces text (www.r-text.org), a new R-package for analyzing and visualizing human language using transformers, the latest techniques from NLP and DL. Text is both a modular solution for accessing state-of-the-art language models and an end-to-end solution catered for human-level analyses. Hence, text provides user-friendly functions tailored to test hypotheses in social sciences for both relatively small and large datasets. This tutorial describes useful methods for analyzing text, providing functions with reliable defaults that can be used off-the-shelf as well as providing a framework for the advanced users to build on for novel techniques and analysis pipelines. The reader learns about six methods: 1) textEmbed: to transform text to traditional or modern transformer-based word embeddings (i.e., numeric representations of words); 2) textTrain: to examine the relationships between text and numeric/categorical variables; 3) textSimilarity and 4) textSimilarityTest: to computing semantic similarity scores between texts and significance test the difference in meaning between two sets of texts; and 5) textProjection and 6) textProjectionPlot: to examine and visualize text within the embedding space according to latent or specified construct dimensions (e.g., low to high rating scale scores).

Download Full-text

Comparative Analysis of Fine-tuned Deep Learning Language Models for ICD-10 classification task for Bulgarian Language

10.26615/978-954-452-072-4_162 ◽

2021 ◽

Author(s):

Boris Velichkov ◽

◽

Sylvia Vassileva ◽

Simeon Gerginov ◽

Boris Kraychev ◽

...

Keyword(s):

Deep Learning ◽

Comparative Analysis ◽

Language Models ◽

Classification Task ◽

Icd 10 ◽

Learning Language

Download Full-text

HIDING CRITICAL INFORMATION WHEN TRAINING LANGUAGE MODELS

EurasianUnionScientists ◽

10.31618/esu.2413-9335.2021.1.86.1349 ◽

2021 ◽

pp. 15-18

Author(s):

A. Evtushenko

Keyword(s):

Natural Language ◽

Language Processing ◽

Text Processing ◽

Language Model ◽

Personal Data ◽

Language Models ◽

Training Dataset ◽

Critical Information ◽

Research Company ◽

Learning Language

Machine learning language models are combinations of algorithms and neural networks designed for text processing composed in natural language (Natural Language Processing, NLP). In 2020, the largest language model from the artificial intelligence research company OpenAI, GPT-3, was released, the maximum number of parameters of which reaches 175 billion. The parameterization of the model increased by more than 100 times made it possible to improve the quality of generated texts to a level that is hard to distinguish from human-written texts. It is noteworthy that this model was trained on a training dataset mainly collected from open sources on the Internet, the volume of which is estimated at 570 GB. This article discusses the problem of memorizing critical information, in particular, personal data of individual, at the stage of training large language models (GPT-2/3 and derivatives), and also describes an algorithmic approach to solving this problem, which consists in additional preprocessing training dataset and refinement of the model inference in the context of generating pseudo-personal data and embedding into the results of work on the tasks of summarization, text generation, formation of answers to questions and others from the field of seq2seq.

Download Full-text

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

10.1101/2020.10.28.359828 ◽

2020 ◽

Author(s):

Serbulent Unsal ◽

Heval Ataş ◽

Muammer Albayrak ◽

Kemal Turhan ◽

Aybar C. Acar ◽

...

Keyword(s):

Deep Learning ◽

Language Processing ◽

Complex Traits ◽

Representation Learning ◽

Data Representation ◽

Language Models ◽

Learning Methods ◽

Advantages And Disadvantages ◽

Data Representations ◽

Novel Methods

AbstractData-centric approaches have been utilized to develop predictive methods for elucidating uncharacterized aspects of proteins such as their functions, biophysical properties, subcellular locations and interactions. However, studies indicate that the performance of these methods should be further improved to effectively solve complex problems in biomedicine and biotechnology. A data representation method can be defined as an algorithm that calculates numerical feature vectors for samples in a dataset, to be later used in quantitative modelling tasks. Data representation learning methods do this by training and using a model that employs statistical and machine/deep learning algorithms. These novel methods mostly take inspiration from the data-driven language models that have yielded ground-breaking improvements in the field of natural language processing. Lately, these learned data representations have been applied to the field of protein informatics and have displayed highly promising results in terms of extracting complex traits of proteins regarding sequence-structure-function relations. In this study, we conducted a detailed investigation over protein representation learning methods, by first categorizing and explaining each approach, and then conducting benchmark analyses on; (i) inferring semantic similarities between proteins, (ii) predicting ontology-based protein functions, and (iii) classifying drug target protein families. We examine the advantages and disadvantages of each representation approach over the benchmark results. Finally, we discuss current challenges and suggest future directions. We believe the conclusions of this study will help researchers in applying machine/deep learning-based representation techniques on protein data for various types of predictive tasks. Furthermore, we hope it will demonstrate the potential of machine learning-based data representations for protein science and inspire the development of novel methods/tools to be utilized in the fields of biomedicine and biotechnology.

Download Full-text

Language Models for the Prediction of SARS-CoV-2 Inhibitors

10.1101/2021.12.10.471928 ◽

2021 ◽

Author(s):

Andrew E Blanchard ◽

John Gounley ◽

Debsindhu Bhowmik ◽

Mayanka Chandra Shekar ◽

Isaac Lyngaas ◽

...

Keyword(s):

Deep Learning ◽

Binding Affinity ◽

Language Model ◽

Specific Protein ◽

Language Models ◽

Peak Performance ◽

Protein Targets ◽

Training Time ◽

Mixed Precision ◽

Learning Language

The COVID-19 pandemic highlights the need for computational tools to automate and accelerate drug design for novel protein targets. We leverage deep learning language models to generate and score drug candidates based on predicted protein binding affinity. We pre-trained a deep learning language model (BERT) on ~9.6 billion molecules and achieved peak performance of 603 petaflops in mixed precision. Our work reduces pre-training time from days to hours, compared to previous efforts with this architecture, while also increasing the dataset size by nearly an order of magnitude. For scoring, we fine-tuned the language model using an assembled set of thousands of protein targets with binding affinity data and searched for inhibitors of specific protein targets, SARS-CoV-2 Mpro and PLpro. We utilized a genetic algorithm approach for finding optimal candidates using the generation and scoring capabilities of the language model. Our generalizable models accelerate the identification of inhibitors for emerging therapeutic targets.

Download Full-text

Towards End-to-end Deep Learning Analysis of Electrical Machines

10.36227/techrxiv.13134932.v1 ◽

2020 ◽

Author(s):

Nikita Gabdullin ◽

Sadjad Madanzadeh ◽

Aleksey Vilkin

Keyword(s):

Deep Learning ◽

Language Processing ◽

Prediction Accuracy ◽

Electrical Machines ◽

Shape Prediction ◽

Research Fields ◽

Comparable Accuracy ◽

End To End ◽

Self Driving Cars ◽

Assisted Analysis

<p>Convolutional Neural Networks (CNNs) and Deep Learning (DL) revolutionized numerous research fields including robotics, natural language processing, self-driving cars, healthcare, and others. However, DL is still relatively under-researched in fields such as physics and engineering. Recent works on DL-assisted analysis showed emerging interest and enormous potential of CNN applications. This paper explores the possibility of developing an end-to-end DL pipeline for the analysis of electrical machines. The CNNs are trained on conventional finite element method (FEA) data to predict the output torque curves of electric machines. FEA is only used for dataset collections and CNN training, whereas the analysis is done solely using CNNs. The required depth in CNN architecture is studied by comparing a simplistic CNN with three ResNet architectures. The effects of dataset balancing and data normalization are studied and torque clipping inspired by offset normalization is proposed to ease CNN training and improve the prediction accuracy. The relation between architecture depth and accuracy is identified showing that deeper CNNs improve the curve shape prediction accuracy even after torque magnitude prediction accuracy saturates. Over 90% accuracy for analysis conducted under a minute is reported for CNNs, whereas FEA of comparable accuracy required 200 hours. Predicting multidimensional outputs can improve CNN performance, which is essential for multiparameter optimization of electrical machines. </p>

Download Full-text

KM-BERT: A Pre-trained BERT for Korean Medical Natural Language Processing (Preprint)

10.2196/preprints.31223 ◽

2021 ◽

Author(s):

Yoojoong Kim ◽

Jeong Moon Lee ◽

Moon Joung Jang ◽

Yun Jin Yum ◽

Jong-Ho Kim ◽

...

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Pearson Correlation ◽

Language Model ◽

Language Models ◽

Korean Language ◽

Medical Texts ◽

Proposed Model

BACKGROUND With advances in deep learning and natural language processing, analyzing medical texts is becoming increasingly important. Nonetheless, a study on medical-specific language models has not yet been conducted given the importance of medical texts. OBJECTIVE Korean medical text is highly difficult to analyze because of the agglutinative characteristics of the language as well as the complex terminologies in the medical domain. To solve this problem, we collected a Korean medical corpus and used it to train language models. METHODS In this paper, we present a Korean medical language model based on deep learning natural language processing. The proposed model was trained using the pre-training framework of BERT for the medical context based on a state-of-the-art Korean language model. RESULTS After pre-training, the proposed method showed increased accuracies of 0.147 and 0.148 for the masked language model with next sentence prediction. In the intrinsic evaluation, the next sentence prediction accuracy improved by 0.258, which is a remarkable enhancement. In addition, the extrinsic evaluation of Korean medical semantic textual similarity data showed a 0.046 increase in the Pearson correlation. CONCLUSIONS The results demonstrated the superiority of the proposed model for Korean medical natural language processing. We expect that our proposed model can be extended for application to various languages and domains.

Download Full-text