recommendation to the SSH community: Take a linguist on board

Jeannine Beeken

doi:10.29173/iq992

recommendation to the SSH community: Take a linguist on board

IASSIST Quarterly ◽

10.29173/iq992 ◽

2021 ◽

Vol 45 (1) ◽

Author(s):

Jeannine Beeken

Keyword(s):

Computational Linguistics ◽

Language Processing ◽

Process Data ◽

Data Services ◽

Named Entity ◽

Language Technology ◽

Textual Data ◽

Automated Speech Recognition ◽

Linguistic Approaches ◽

Free Open Source

In this paper we address how Natural Language Processing (NLP) approaches and language technology can contribute to data services in different ways; from providing social science users with new approaches and tools to explore oral and textual data, to enhancing the search, findability and retrieval of data sources. By using linguistic approaches we are able to process data, for example using Automated Speech Recognition (ASR) and named entity recognizers (NER), extract key concepts and terms, and improve search strategies. We provide examples of how computational linguistics contribute to and facilitate the mining and analysis of oral or textual material, for example (transcribed) interviews or oral histories, and show how free open source (OS) tools can be used very easily to gain a quick overview of the key features of text, which can be further exploited as useful metadata.

Download Full-text

Advances in Computational Linguistics and Text Processing Frameworks

Advances in Computer and Electrical Engineering - Handbook of Research on Engineering Innovations and Technology Management in Organizations ◽

10.4018/978-1-7998-2772-6.ch012 ◽

2020 ◽

pp. 217-244

Author(s):

Ayush Srivastav ◽

Hera Khan ◽

Amit Kumar Mishra

Keyword(s):

Neural Networks ◽

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Text Processing ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.

Download Full-text

PAAD: POLITICAL ARABIC ARTICLES DATASET FOR AUTOMATIC TEXT CATEGORIZATION

Iraqi Journal for Computers and Informatics ◽

10.25195/ijci.v46i1.246 ◽

2020 ◽

Vol 46 (1) ◽

pp. 1-10

Author(s):

Dhafar Hamed Abd ◽

Ahmed T. Sadiq ◽

Ayad R. Abbas

Keyword(s):

Computational Linguistics ◽

Language Processing ◽

Text Classification ◽

Text Categorization ◽

Political Orientation ◽

Huge Amount ◽

Textual Data ◽

Automatic Text ◽

Excel File ◽

Modern Standard

Now day’s text Classification and Sentiment analysis is considered as one of the popular Natural Language Processing (NLP) tasks. This kind of technique plays significant role in human activities and has impact on the daily behaviours. Each article in different fields such as politics and business represent different opinions according to the writer tendency. A huge amount of data will be acquired through that differentiation. The capability to manage the political orientation of an online article automatically. Therefore, there is no corpus for political categorization was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. However, we introduce political Arabic articles dataset (PAAD) of textual data collected from newspapers, social network, general forum and ideology website. The dataset is 206 articles distributed into three categories as (Reform, Conservative and Revolutionary) that we offer to the research community on Arabic computational linguistics. We anticipate that this dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic, political text classification purposes. We present the data in raw form and excel file. Excel file will be in four types such as V1 raw data, V2 preprocessing, V3 root stemming and V4 light stemming.

Download Full-text

Literary Onomastics and Language Technology

Literary Education and Digital Learning ◽

10.4018/978-1-60566-932-8.ch003 ◽

2010 ◽

pp. 53-78 ◽

Cited By ~ 1

Author(s):

Lars Borin ◽

Dimitrios Kokkinakis

Keyword(s):

Cultural Heritage ◽

Computational Linguistics ◽

Corpus Linguistics ◽

Text Processing ◽

Named Entity Recognition ◽

Research Field ◽

Entity Recognition ◽

Named Entity ◽

Language Technology ◽

Intelligent Information

In this chapter, the authors describe the development and application of language technology for intelligent information access to the content of digitized cultural heritage collections in the form of Swedish classical literary works. This technology offers sophisticated and flexible support functions to literary scholars and researchers. The authors focus on one kind of text processing technology (named entity recognition) and one research field (literary onomastics), but try to argue that the techniques involved are quite general and can be further developed in a number of directions. This way, the authors aim at supporting the users of digitized literature collections with tools that enable semantic search, browsing and indexing of texts. In this sense, the authors offer new ways for exploring the large volumes of literary texts being made available through national cultural heritage digitization projects. Language technology; Computational linguistics; Natural language processing; Literary onomastics; Named entity recognition; Corpus linguistics; Corpus annotation; Digital resources; Text technology; Cultural heritage

Download Full-text

Deep learning in clinical natural language processing: a methodical review

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocz200 ◽

2019 ◽

Vol 27 (3) ◽

pp. 457-470 ◽

Cited By ~ 25

Author(s):

Stephen Wu ◽

Kirk Roberts ◽

Surabhi Datta ◽

Jingcheng Du ◽

Zongcheng Ji ◽

...

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Recurrent Neural Networks ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Abstract Objective This article methodically reviews the literature on deep learning (DL) for natural language processing (NLP) in the clinical domain, providing quantitative analysis to answer 3 research questions concerning methods, scope, and context of current research. Materials and Methods We searched MEDLINE, EMBASE, Scopus, the Association for Computing Machinery Digital Library, and the Association for Computational Linguistics Anthology for articles using DL-based approaches to NLP problems in electronic health records. After screening 1,737 articles, we collected data on 25 variables across 212 papers. Results DL in clinical NLP publications more than doubled each year, through 2018. Recurrent neural networks (60.8%) and word2vec embeddings (74.1%) were the most popular methods; the information extraction tasks of text classification, named entity recognition, and relation extraction were dominant (89.2%). However, there was a “long tail” of other methods and specific tasks. Most contributions were methodological variants or applications, but 20.8% were new methods of some kind. The earliest adopters were in the NLP community, but the medical informatics community was the most prolific. Discussion Our analysis shows growing acceptance of deep learning as a baseline for NLP research, and of DL-based NLP in the medical community. A number of common associations were substantiated (eg, the preference of recurrent neural networks for sequence-labeling named entity recognition), while others were surprisingly nuanced (eg, the scarcity of French language clinical NLP with deep learning). Conclusion Deep learning has not yet fully penetrated clinical NLP and is growing rapidly. This review highlighted both the popular and unique trends in this active field.

Download Full-text

Computational Phonology

Linguistics ◽

10.1093/obo/9780199772810-0249 ◽

2019 ◽

Author(s):

Jane Chandlee

Keyword(s):

Computational Linguistics ◽

Language Processing ◽

Optimality Theory ◽

Learning Algorithms ◽

Point Of Interest ◽

Computational Phonology ◽

Language Technology ◽

Finite State ◽

State Models ◽

Finite State Models

Much like the term “computational linguistics”, the term “computational phonology” has come to mean different things to different people. Research grounded in a variety of methodologies and formalisms can be included in its scope. The common thread of the research that falls under this umbrella term is the use of computational methods to investigate questions of interest in phonology, primarily how to delimit the set of possible phonological patterns from the larger set of “logically possible” patterns and how those patterns are learned. Computational phonology arguably began with the foundational result that Sound Pattern of English (SPE) rules are regular relations (provided they can’t recursively apply to their own structural change), which means they can be modeled with finite-state transducers (FSTs) and that a system of ordered rules can be composed into a single FST. The significance of this result can be seen in the prominence of finite-state models both in theoretical phonology research and in more applied areas like natural language processing and human language technology. The shift in the field of phonology from rule-based grammars to constraint-based frameworks like Optimality Theory (OT) initially sparked interest in the question of how to model OT with FSTs and thereby preserve the noted restriction of phonology to the complexity level of regular. But an additional point of interest for computational work on OT stemmed from the ways in which its architecture readily lends itself to the development of learning algorithms and models, including statistical approaches that address recognized challenges such as gradient acceptability, process optionality, and the learning of underlying forms and hidden structure. Another line of research has taken on the question of to what extent phonology is not just regular, but subregular, meaning describable with proper subclasses of the regular languages and relations. The advantages of subregular modeling of phonological phenomena are argued to be stronger typological explanations, in that the computational properties that establish the subclasses as properly subregular restrict the kinds of phenomena that can be described in desirable ways. Also, these same restrictions lead directly to provably correct learning algorithms. Once again this work has made extensive use of the finite-state formalism, but it has also employed logical characterizations that more readily extend from strings to non-linear phenomena such as autosegmental representations and syllable structure.

Download Full-text

Text Preprocessing

Advances in Human and Social Aspects of Technology - Information Visualization Techniques in the Social Sciences and Humanities ◽

10.4018/978-1-5225-4990-1.ch006 ◽

2018 ◽

pp. 86-104

Author(s):

Piotr Malak

Keyword(s):

Computational Linguistics ◽

Language Processing ◽

Text Processing ◽

Data Gathering ◽

Original Data ◽

Practical Experience ◽

Digital Data ◽

Quality Of Data ◽

Textual Data ◽

Text Preprocessing

Digital humanities and information visualization rely on huge sets of digital data. Those data are mostly delivered in the text form. Although computational linguistics provides a lot of valuable tools for text processing, the initial phase (text preprocessing) is very involved and time-consuming. The problems arise due to a human factor – they are not always errors; there is also inconsistency in forms, affecting data quality. In this chapter, the author describes and discusses the main issues that arise during the preprocessing phase of textual data gathering for InfoVis. Chosen examples of InfoVis applications are presented. Except for problems with raw, original data, solutions are also referred. Canonical approaches used in text preprocessing and common issues affecting the process and ways to prevent them are also presented. The quality of data from different sources is also discussed. The content of this chapter is a result of a few years of practical experience in natural language processing gained during realization of different projects and evaluation campaigns.

Download Full-text

Looking into the Operational Modalities Adopted in Some of the POS Tagging Tools in Identification of Contextual Part-of-Speech of Words in Texts

International Journal of Applied Linguistics & English Literature ◽

10.7575/aiac.ijalel.v.8n.6p.92 ◽

2019 ◽

Vol 8 (6) ◽

pp. 92

Author(s):

Kesavan Vadakalur Elumalai ◽

Niladri Sekhar Das ◽

Mufleh Salem M. Alqahtani ◽

Anas Maktabi

Keyword(s):

Natural Language ◽

Computational Linguistics ◽

Language Processing ◽

Text Processing ◽

Form And Function ◽

Pos Tagging ◽

Part Of Speech ◽

Language Technology ◽

Learning Machine ◽

And Function

Part-of-speech (POS) tagging is an indispensable method of text processing. The main aim is to assign part-of-speech to words after considering their actual contextual syntactic-cum-semantic roles in a piece of text where they occur (Siemund & Claridge 1997). This is a useful strategy in language processing, language technology, machine learning, machine translation, and computational linguistics as it generates a kind of output that enables a system to work with natural language texts with greater accuracy and success. Part-of-speech tagging is also known as ‘grammatical annotation’ and ‘word category disambiguation’ in some area of linguistics where analysis of form and function of words are important avenues for better comprehension and application of texts. Since the primary task of POS tagging involves a process of assigning a tag to each word, manually or automatically, in a piece of natural language text, it has to pay adequate attention to the contexts where words are used. This is a tough challenge for a system as it normally fails to know how word carries specific linguistic information in a text and what kind of larger syntactic frames it requires for its operation. The present paper takes up this issue into consideration and tries to critically explore how some of the well-known POS tagging systems are capable of handling this kind of challenge and if these POS tagging systems are at all successful in assigning appropriate POS tags to words without accessing information from extratextual domains. The novelty of the paper lies in its attempt for looking into some of the POS tagging schemes proposed so far to see if the systems are actually successful in dealing with the complexities involved in tagging words in texts. It also checks if the performance of these systems is better than manual POS tagging and verifies if information and insights gathered from such enterprises are at all useful for enhancing our understanding about identity and function of words used in texts. All these are addressed in this paper with reference to some of the POS taggers available to us. Moreover, the paper tries to see how a POS tagged text is useful in various applications thereby creating a sense of awareness about multifunctionality of tagged texts among language users.

Download Full-text

Language Technology and Computational Linguistics

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a3877.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 1047-1049

Keyword(s):

Computational Linguistics ◽

Language Processing ◽

Vital Role ◽

Human Knowledge ◽

Human Beings ◽

Change Communication ◽

Language Technology ◽

Communication Models ◽

World Information ◽

Present World

the present world information technology is predominant which has the ability to change the future of this Globe. Being a Futuristic Technology. Computational linguistics can change communication models among human beings. Due to the changing context and development of Natural Language Processing various new doors are open in the fields of computational linguistics. Computational Linguistics (CL) increases the applicability of Language Technology towards man-machine interactions. Globalization can convert the world into a small village. For interchange human knowledge among various communities, Auto language processing play a vital role. In this paper, it is tried to discuss the various dimension of language technology and computational linguistics

Download Full-text