Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

Edoardo Maria Ponti; Helen O’Horan; Yevgeni Berzak; Ivan Vulić; Roi Reichart; Thierry Poibeau; Ekaterina Shutova; Anna Korhonen

doi:10.1162/coli_a_00357

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

Computational Linguistics ◽

10.1162/coli_a_00357 ◽

2019 ◽

Vol 45 (3) ◽

pp. 559-601 ◽

Cited By ~ 3

Author(s):

Edoardo Maria Ponti ◽

Helen O’Horan ◽

Yevgeni Berzak ◽

Ivan Vulić ◽

Roi Reichart ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Machine Learning Algorithms ◽

Data Driven ◽

Extensive Literature ◽

New Approach ◽

Recent Developments ◽

Discrete Nature

Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-utilization of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such an approach could be facilitated by recent developments in data-driven induction of typological knowledge.

Download Full-text

Applications of Natural Language Processing in Biodiversity Science

Advances in Bioinformatics ◽

10.1155/2012/391574 ◽

2012 ◽

Vol 2012 ◽

pp. 1-17 ◽

Cited By ~ 30

Author(s):

Anne E. Thessen ◽

Hong Cui ◽

Dmitry Mozzherin

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Large Scale ◽

Morphological Characters ◽

Machine Learning Algorithms ◽

Biological Information ◽

Biological Knowledge ◽

Biodiversity Science

Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science.

Download Full-text

A Review and evaluation of Machine Translation methods for Lumasaaba

Journal of Digital Science ◽

10.33847/2686-8296.2.1_1 ◽

2020 ◽

pp. 3-17

Author(s):

Peter Nabende

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Machine Translation ◽

Language Processing ◽

Research Area ◽

Data Driven ◽

East African ◽

Data Set ◽

African Languages ◽

Translation Methods

Natural Language Processing for under-resourced languages is now a mainstream research area. However, there are limited studies on Natural Language Processing applications for many indigenous East African languages. As a contribution to covering the current gap of knowledge, this paper focuses on evaluating the application of well-established machine translation methods for one heavily under-resourced indigenous East African language called Lumasaaba. Specifically, we review the most common machine translation methods in the context of Lumasaaba including both rule-based and data-driven methods. Then we apply a state of the art data-driven machine translation method to learn models for automating translation between Lumasaaba and English using a very limited data set of parallel sentences. Automatic evaluation results show that a transformer-based Neural Machine Translation model architecture leads to consistently better BLEU scores than the recurrent neural network-based models. Moreover, the automatically generated translations can be comprehended to a reasonable extent and are usually associated with the source language input.

Download Full-text

Machine-learning as a validated tool to characterize individual differences in free recall of naturalistic events.

10.31234/osf.io/uygzv ◽

2021 ◽

Author(s):

Xinxu Shen ◽

Troy Houser ◽

David Victor Smith ◽

Vishnu P. Murty

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Individual Difference ◽

Language Processing ◽

Large Scale ◽

High Reliability ◽

Difference Analysis ◽

Universal Sentence ◽

Natural Language Processing Tool

The use of naturalistic stimuli, such as narrative movies, is gaining popularity in many fields, characterizing memory, affect, and decision-making. Narrative recall paradigms are often used to capture the complexity and richness of memory for naturalistic events. However, scoring narrative recalls is time-consuming and prone to human biases. Here, we show the validity and reliability of using a natural language processing tool, the Universal Sentence Encoder (USE), to automatically score narrative recall. We compared the reliability in scoring made between two independent raters (i.e., hand-scored) and between our automated algorithm and individual raters (i.e., automated) on trial-unique, video clips of magic tricks. Study 1 showed that our automated segmentation approaches yielded high reliability and reflected measures yielded by hand-scoring, and further that the results using USE outperformed another popular natural language processing tool, GloVe. In study two, we tested whether our automated approach remained valid when testing individual’s varying on clinically-relevant dimensions that influence episodic memory, age and anxiety. We found that our automated approach was equally reliable across both age groups and anxiety groups, which shows the efficacy of our approach to assess narrative recall in large-scale individual difference analysis. In sum, these findings suggested that machine learning approaches implementing USE are a promising tool for scoring large-scale narrative recalls and perform individual difference analysis for research using naturalistic stimuli.

Download Full-text

The Experience of Developing a Large-Scale Natural Language Processing System: Critique

The Kluwer International Series in Engineering and Computer Science - Natural Language Processing: The PLNLP Approach ◽

10.1007/978-1-4615-3170-8_7 ◽

1993 ◽

pp. 77-89 ◽

Cited By ~ 2

Author(s):

Stephen Richardson ◽

Lisa Braden-Harder

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Processing System ◽

Natural Language Processing System

Download Full-text

Designing and Validating an Annotation Model of Dynamic Modality for English and Spanish: Issues and Problems

10.29007/pc58 ◽

2018 ◽

Author(s):

Julia Lavid ◽

Marta Carretero ◽

Juan Rafael Zamorano

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Reliability Study ◽

Annotation Scheme ◽

High Degree ◽

Difficult Cases

In this paper we set forth an annotation model for dynamic modality in English and Spanish, given its relevance not only for contrastive linguistic purposes, but also for its impact on practical annotation tasks in the Natural Language Processing (NLP) community. An annotation scheme is proposed, which captures both the functional-semantic meanings and the language-specific realisations of dynamic meanings in both languages. The scheme is validated through a reliability study performed on a randomly selected set of one hundred and twenty sentences from the MULTINOT corpus, resulting in a high degree of inter-annotator agreement. We discuss our main findings and give attention to the difficult cases as they are currently being used to develop detailed guidelines for the large-scale annotation of dynamic modality in English and Spanish.

Download Full-text

Comparison of Templates with Word2vec in Finding Semantic Relations Between Words

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201805007 ◽

2018 ◽

pp. 13-17

Author(s):

Kaan Ant ◽

Ugur Sogukpinar ◽

Mehmet Fatif Amasyali

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Semantic Relations ◽

Template Method ◽

Semantic Relationships ◽

Semantic Spaces

The use of databases those containing semantic relationships between words is becoming increasingly widespread in order to make natural language processing work more effective. Instead of the word-bag approach, the suggested semantic spaces give the distances between words, but they do not express the relation types. In this study, it is shown how semantic spaces can be used to find the type of relationship and it is compared with the template method. According to the results obtained on a very large scale, while is_a and opposite are more successful for semantic spaces for relations, the approach of templates is more successful in the relation types at_location, made_of and non relational.

Download Full-text

A Comparative Analysis of Machine Learning Techniques for Spam Detection

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1308 ◽

2021 ◽

pp. 657-661

Author(s):

Rashida Ali ◽

Ibrahim Rampurawala ◽

Mayuri Wandhe ◽

Ruchika Shrikhande ◽

Arpita Bhatkar

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Comparative Analysis ◽

Natural Language ◽

Language Processing ◽

High Volume ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Spam Detection ◽

Learning Techniques

Internet provides a medium to connect with individuals of similar or different interests creating a hub. Since a huge hub participates on these platforms, the user can receive a high volume of messages from different individuals creating a chaos and unwanted messages. These messages sometimes contain a true information and sometimes false, which leads to a state of confusion in the minds of the users and leads to first step towards spam messaging. Spam messages means an irrelevant and unsolicited message sent by a known/unknown user which may lead to a sense of insecurity among users. In this paper, the different machine learning algorithms were trained and tested with natural language processing (NLP) to classify whether the messages are spam or ham.

Download Full-text

Natural Language Processing in Large-Scale Neural Models for Medical Screenings

Frontiers in Robotics and AI ◽

10.3389/frobt.2019.00062 ◽

2019 ◽

Vol 6 ◽

Cited By ~ 1

Author(s):

Catharina Marie Stille ◽

Trevor Bekolay ◽

Peter Blouw ◽

Bernd J. Kröger

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Neural Models

Download Full-text

YouTube as a Source of Information in Understanding Autonomous Vehicle Consumers: Natural Language Processing Study

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198119842110 ◽

2019 ◽

Vol 2673 (8) ◽

pp. 242-253 ◽

Cited By ~ 5

Author(s):

Subasish Das ◽

Anandi Dutta ◽

Tomas Lindheimer ◽

Mohammad Jalayer ◽

Zachary Elgart

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Automotive Industry ◽

Autonomous Vehicles ◽

Large Scale ◽

Keyword Search ◽

Autonomous Vehicle ◽

Perception Of Safety ◽

Automation Level

The automotive industry is currently experiencing a revolution with the advent and deployment of autonomous vehicles. Several countries are conducting large-scale testing of autonomous vehicles on private and even public roads. It is important to examine the attitudes and potential concerns of end users towards autonomous cars before mass deployment. To facilitate the transition to autonomous vehicles, the automotive industry produces many videos on its products and technologies. The largest video sharing website, YouTube.com, hosts many videos on autonomous vehicle technology. Content analysis and text mining of the comments related to the videos with large numbers of views can provide insight about potential end-user feedback. This study examines two questions: first, how do people view autonomous vehicles? Second, what polarities exist regarding (a) content and (b) automation level? The researchers found 107 videos on YouTube using a related keyword search and examined comments on the 15 most-viewed videos, which had a total of 60.9 million views and around 25,000 comments. The videos were manually clustered based on their content and automation level. This study used two natural language processing (NLP) tools to perform knowledge discovery from a bag of approximately seven million words. The key issues in the comment threads were mostly associated with efficiency, performance, trust, comfort, and safety. The perception of safety and risk increased in the textual contents when videos presented full automation level. Sentiment analysis shows mixed sentiments towards autonomous vehicle technologies, however, the positive sentiments were higher than the negative.

Download Full-text

Natural Language Processing Based Method for Clustering and Analysis of Aviation Safety Narratives

Aerospace ◽

10.3390/aerospace7100143 ◽

2020 ◽

Vol 7 (10) ◽

pp. 143

Author(s):

Rodrigo L. Rose ◽

Tejas G. Puranik ◽

Dimitri N. Mavris

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Safety Data ◽

Aviation Safety ◽

Data Driven ◽

Flight Safety ◽

Commercial Aviation ◽

Unique Challenge ◽

Aviation Safety Reporting System

The complexity of commercial aviation operations has grown substantially in recent years, together with a diversification of techniques for collecting and analyzing flight data. As a result, data-driven frameworks for enhancing flight safety have grown in popularity. Data-driven techniques offer efficient and repeatable exploration of patterns and anomalies in large datasets. Text-based flight safety data presents a unique challenge in its subjectivity, and relies on natural language processing tools to extract underlying trends from narratives. In this paper, a methodology is presented for the analysis of aviation safety narratives based on text-based accounts of in-flight events and categorical metadata parameters which accompany them. An extensive pre-processing routine is presented, including a comparison between numeric models of textual representation for the purposes of document classification. A framework for categorizing and visualizing narratives is presented through a combination of k-means clustering and 2-D mapping with t-Distributed Stochastic Neighbor Embedding (t-SNE). A cluster post-processing routine is developed for identifying driving factors in each cluster and building a hierarchical structure of cluster and sub-cluster labels. The Aviation Safety Reporting System (ASRS), which includes over a million de-identified voluntarily submitted reports describing aviation safety incidents for commercial flights, is analyzed as a case study for the methodology. The method results in the identification of 10 major clusters and a total of 31 sub-clusters. The identified groupings are post-processed through metadata-based statistical analysis of the learned clusters. The developed method shows promise in uncovering trends from clusters that are not evident in existing anomaly labels in the data and offers a new tool for obtaining insights from text-based safety data that complement existing approaches.

Download Full-text