Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

Juan Cruz-Benito; Sanjay Vishwakarma; Francisco Martin-Fernandez; Ismael Faro

doi:10.3390/ai2010001

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

AI ◽

10.3390/ai2010001 ◽

2021 ◽

Vol 2 (1) ◽

pp. 1-16

Author(s):

Juan Cruz-Benito ◽

Sanjay Vishwakarma ◽

Francisco Martin-Fernandez ◽

Ismael Faro

Keyword(s):

Deep Learning ◽

Learning Community ◽

Programming Languages ◽

Language Processing ◽

Code Generation ◽

Language Model ◽

Language Models ◽

Stochastic Gradient Descent ◽

Network Architectures ◽

Learning Architectures

In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.

Download Full-text

Astrid

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436907 ◽

2020 ◽

Vol 14 (4) ◽

pp. 471-484

Author(s):

Suraj Shetiya ◽

Saravanan Thirumuruganathan ◽

Nick Koudas ◽

Gautam Das

Keyword(s):

Deep Learning ◽

Objective Function ◽

Pattern Matching ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Selectivity Estimation ◽

Statistical Correlations ◽

Benchmark Datasets ◽

Traditional Approaches

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Download Full-text

KM-BERT: A Pre-trained BERT for Korean Medical Natural Language Processing (Preprint)

10.2196/preprints.31223 ◽

2021 ◽

Author(s):

Yoojoong Kim ◽

Jeong Moon Lee ◽

Moon Joung Jang ◽

Yun Jin Yum ◽

Jong-Ho Kim ◽

...

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Pearson Correlation ◽

Language Model ◽

Language Models ◽

Korean Language ◽

Medical Texts ◽

Proposed Model

BACKGROUND With advances in deep learning and natural language processing, analyzing medical texts is becoming increasingly important. Nonetheless, a study on medical-specific language models has not yet been conducted given the importance of medical texts. OBJECTIVE Korean medical text is highly difficult to analyze because of the agglutinative characteristics of the language as well as the complex terminologies in the medical domain. To solve this problem, we collected a Korean medical corpus and used it to train language models. METHODS In this paper, we present a Korean medical language model based on deep learning natural language processing. The proposed model was trained using the pre-training framework of BERT for the medical context based on a state-of-the-art Korean language model. RESULTS After pre-training, the proposed method showed increased accuracies of 0.147 and 0.148 for the masked language model with next sentence prediction. In the intrinsic evaluation, the next sentence prediction accuracy improved by 0.258, which is a remarkable enhancement. In addition, the extrinsic evaluation of Korean medical semantic textual similarity data showed a 0.046 increase in the Pearson correlation. CONCLUSIONS The results demonstrated the superiority of the proposed model for Korean medical natural language processing. We expect that our proposed model can be extended for application to various languages and domains.

Download Full-text

Rare Words: A Major Problem for Contextualized Embeddings and How to Fix it by Attentive Mimicking

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6403 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8766-8774 ◽

Cited By ~ 1

Author(s):

Timo Schick ◽

Hinrich Schütze

Keyword(s):

Neural Network ◽

Natural Language Processing ◽

Language Processing ◽

Deep Neural Network ◽

Language Model ◽

Language Modeling ◽

Fine Tuning ◽

Language Models ◽

Network Architectures ◽

Semantic Properties

Pretraining deep neural network architectures with a language modeling objective has brought large improvements for many natural language processing tasks. Exemplified by BERT, a recently proposed such architecture, we demonstrate that despite being trained on huge amounts of data, deep language models still struggle to understand rare words. To fix this problem, we adapt Attentive Mimicking, a method that was designed to explicitly learn embeddings for rare words, to deep language models. In order to make this possible, we introduce one-token approximation, a procedure that enables us to use Attentive Mimicking even when the underlying language model uses subword-based tokenization, i.e., it does not assign embeddings to all words. To evaluate our method, we create a novel dataset that tests the ability of language models to capture semantic properties of words without any task-specific fine-tuning. Using this dataset, we show that adding our adapted version of Attentive Mimicking to BERT does substantially improve its understanding of rare words.

Download Full-text

Survey on Neural Network Architectures with Deep Learning

Journal of Soft Computing Paradigm - September 2019 ◽

10.36548/jscp.2020.3.007 ◽

2020 ◽

Vol 2 (3) ◽

pp. 186-194

Author(s):

Smys S. ◽

Joy Iong Zong Chen ◽

Subarna Shakya

Keyword(s):

Deep Learning ◽

Language Processing ◽

Visual Recognition ◽

Research Work ◽

Network Architectures ◽

Significant Performance ◽

Recent Trends ◽

Cost Efficient ◽

The Cost ◽

Learning Architectures

In the present research era, machine learning is an important and unavoidable zone where it provides better solutions to various domains. In particular deep learning is one of the cost efficient, effective supervised learning model, which can be applied to various complicated issues. Since deep learning has various illustrative features and it doesn’t depend on any limited learning methods which helps to obtain better solutions. As deep learning has significant performance and advancements it is widely used in various applications like image classification, face recognition, visual recognition, language processing, speech recognition, object detection and various science, business analysis, etc., This survey work mainly provides an insight about deep learning through an intensive analysis of deep learning architectures and its characteristics along with its limitations. Also, this research work analyses recent trends in deep learning through various literatures to explore the present evolution in deep learning models.

Download Full-text

Two New Large Corpora for Vietnamese Aspect-based Sentiment Analysis at Sentence Level

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3446678 ◽

2021 ◽

Vol 20 (4) ◽

pp. 1-22

Author(s):

Dang Van Thin ◽

Ngan Luu-Thuy Nguyen ◽

Tri Minh Truong ◽

Lac Si Le ◽

Duy Tin Vo

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

Network Architectures ◽

Low Resource ◽

The Neural Network ◽

Sentence Level ◽

Push Forward ◽

Polarity Classification ◽

Learning Architectures ◽

Single Approach

Aspect-based sentiment analysis has been studied in both research and industrial communities over recent years. For the low-resource languages, the standard benchmark corpora play an important role in the development of methods. In this article, we introduce two benchmark corpora with the largest sizes at sentence-level for two tasks: Aspect Category Detection and Aspect Polarity Classification in Vietnamese. Our corpora are annotated with high inter-annotator agreements for the restaurant and hotel domains. The release of our corpora would push forward the low-resource language processing community. In addition, we deploy and compare the effectiveness of supervised learning methods with a single and multi-task approach based on deep learning architectures. Experimental results on our corpora show that the multi-task approach based on BERT architecture outperforms the neural network architectures and the single approach. Our corpora and source code are published on this footnoted site. 1

Download Full-text

A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends

10.20944/preprints201902.0233.v1 ◽

2019 ◽

Cited By ~ 5

Author(s):

Saptarshi Sengupta ◽

Sanchita Basak ◽

Pallabi Saikia ◽

Sayak Paul ◽

Vasilios Tsalavoutis ◽

...

Keyword(s):

Deep Learning ◽

Power Systems ◽

Learning Community ◽

Language Processing ◽

Public Perception ◽

Structure Learning ◽

Financial Time Series ◽

Anomalous Behavior ◽

Connectionist Networks ◽

Systems Research

Deep learning has taken over - both in problems beyond the realm of traditional, hand-crafted machine learning paradigms as well as in capturing the imagination of the practitioner sitting on top of petabytes of data. While the public perception about the efficacy of deep neural architectures in complex pattern recognition tasks grows, sequentially up-to-date primers on the current state of affairs must follow. In this review, we seek to present a refresher of the many different stacked, connectionist networks that make up the deep learning architectures followed by automatic architecture optimization protocols using multi-agent approaches. Further, since guaranteeing system uptime is fast becoming an indispensable asset across multiple industrial modalities, we include an investigative section on testing neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where deep learning has emerged as a game-changing technology - be it anomalous behavior detection in financial applications or financial time-series forecasting, predictive and prescriptive analytics, medical imaging, natural language processing or power systems research. The thrust of this review is on outlining emerging areas of application-oriented research within the deep learning community as well as to provide a handy reference to researchers seeking to embrace deep learning in their work for what it is: statistical pattern recognizers with unparalleled hierarchical structure learning capacity with the ability to scale with information.

Download Full-text

Senti-BAS: A BERT-based model with sentiment computing for happiness research (Preprint)

10.2196/preprints.27914 ◽

2021 ◽

Author(s):

Zeyuan Zeng ◽

Yijia Zhang ◽

Liang Yang ◽

Hongfei Lin

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Language Processing ◽

High Accuracy ◽

Language Models ◽

Fine Grained ◽

Label Information ◽

Common Criterion ◽

Text Content ◽

Sentiment Computing

BACKGROUND Happiness becomes a rising topic that we all care about recently. It can be described in various forms. For the text content, it is an interesting subject that we can do research on happiness by utilizing natural language processing (NLP) methods. OBJECTIVE As an abstract and complicated emotion, there is no common criterion to measure and describe happiness. Therefore, researchers are creating different models to study and measure happiness. METHODS In this paper, we present a deep-learning based model called Senti-BAS (BERT embedded Bi-LSTM with self-Attention mechanism along with the Sentiment computing). RESULTS Given a sentence that describes how a person felt happiness recently, the model can classify the happiness scenario in the sentence with two topics: was it controlled by the author (label ‘agency’), and was it involving other people (label ‘social’). Besides language models, we employ the label information through sentiment computing based on lexicon. CONCLUSIONS The model performs with a high accuracy on both ‘agency’ and ‘social’ labels, and we also make comparisons with several popular embedding models like Elmo, GPT. Depending on our work, we can study the happiness at a more fine-grained level.

Download Full-text

Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Applied Sciences ◽

10.3390/app9183648 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3648

Author(s):

Casper S. Shikali ◽

Zhou Sijie ◽

Liu Qihe ◽

Refuoe Mokhosi

Keyword(s):

Language Processing ◽

Critical Role ◽

Language Model ◽

Central Africa ◽

Spoken Language ◽

Language Models ◽

Word Embeddings ◽

Word Representation

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Download Full-text

COVID-19 sentiment analysis via deep learning during the rise of novel cases

PLoS ONE ◽

10.1371/journal.pone.0255615 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0255615 ◽

Cited By ~ 1

Author(s):

Rohitash Chandra ◽

Aswin Krishna

Keyword(s):

Deep Learning ◽

Sentiment Analysis ◽

Short Term Memory ◽

Language Model ◽

Deep Understanding ◽

Language Models ◽

Catastrophic Events ◽

Social Scientists ◽

Significant Group ◽

Global Vector

Social scientists and psychologists take interest in understanding how people express emotions and sentiments when dealing with catastrophic events such as natural disasters, political unrest, and terrorism. The COVID-19 pandemic is a catastrophic event that has raised a number of psychological issues such as depression given abrupt social changes and lack of employment. Advancements of deep learning-based language models have been promising for sentiment analysis with data from social networks such as Twitter. Given the situation with COVID-19 pandemic, different countries had different peaks where rise and fall of new cases affected lock-downs which directly affected the economy and employment. During the rise of COVID-19 cases with stricter lock-downs, people have been expressing their sentiments in social media. This can provide a deep understanding of human psychology during catastrophic events. In this paper, we present a framework that employs deep learning-based language models via long short-term memory (LSTM) recurrent neural networks for sentiment analysis during the rise of novel COVID-19 cases in India. The framework features LSTM language model with a global vector embedding and state-of-art BERT language model. We review the sentiments expressed for selective months in 2020 which covers the major peak of novel cases in India. Our framework utilises multi-label sentiment classification where more than one sentiment can be expressed at once. Our results indicate that the majority of the tweets have been positive with high levels of optimism during the rise of the novel COVID-19 cases and the number of tweets significantly lowered towards the peak. We find that the optimistic, annoyed and joking tweets mostly dominate the monthly tweets with much lower portion of negative sentiments. The predictions generally indicate that although the majority have been optimistic, a significant group of population has been annoyed towards the way the pandemic was handled by the authorities.

Download Full-text

Deep learning model for metagenome fragment classification using spaced k-mers feature extraction

Jurnal Teknologi dan Sistem Komputer ◽

10.14710/jtsiskom.2020.13407 ◽

2020 ◽

Vol 8 (3) ◽

pp. 234-238

Author(s):

Nur Choiriyati ◽

Yandra Arkeman ◽

Wisnu Ananta Kusuma

Keyword(s):

Neural Network ◽

Feature Extraction ◽

Deep Learning ◽

Language Processing ◽

Computational Time ◽

Genus Level ◽

Computational Resources ◽

Learning Architectures ◽

Deep Learning Model

An open challenge in bioinformatics is the analysis of the sequenced metagenomes from the various environments. Several studies demonstrated bacteria classification at the genus level using k-mers as feature extraction where the highest value of k gives better accuracy but it is costly in terms of computational resources and computational time. Spaced k-mers method was used to extract the feature of the sequence using 111 1111 10001 where 1 was a match and 0 was the condition that could be a match or did not match. Currently, deep learning provides the best solutions to many problems in image recognition, speech recognition, and natural language processing. In this research, two different deep learning architectures, namely Deep Neural Network (DNN) and Convolutional Neural Network (CNN), trained to approach the taxonomic classification of metagenome data and spaced k-mers method for feature extraction. The result showed the DNN classifier reached 90.89 % and the CNN classifier reached 88.89 % accuracy at the genus level taxonomy.

Download Full-text