Automatic lexical classification: bridging research and practice

Anna Korhonen

doi:10.1098/rsta.2010.0039

Automatic lexical classification: bridging research and practice

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2010.0039 ◽

2010 ◽

Vol 368 (1924) ◽

pp. 3621-3632 ◽

Cited By ~ 3

Author(s):

Anna Korhonen

Keyword(s):

Language Processing ◽

Large Scale ◽

Lexical Resources ◽

Lexical Resource ◽

Textual Data ◽

Word Classification ◽

The Many ◽

Bridging Research And Practice ◽

Language Technologies ◽

Lexical Classification

Natural language processing (NLP)—the automatic analysis, understanding and generation of human language by computers—is vitally dependent on accurate knowledge about words . Because words change their behaviour between text types, domains and sub-languages, a fully accurate static lexical resource (e.g. a dictionary, word classification) is unattainable. Researchers are now developing techniques that could be used to automatically acquire or update lexical resources from textual data. If successful, the automatic approach could considerably enhance the accuracy and portability of language technologies, such as machine translation, text mining and summarization. This paper reviews the recent and on-going research in automatic lexical acquisition. Focusing on lexical classification, it discusses the many challenges that still need to be met before the approach can benefit NLP on a large scale.

Download Full-text

ELHISA: An architecture for the integration of heterogeneous lexical information

Natural Language Engineering ◽

10.1017/s1351324907004615 ◽

2008 ◽

Vol 14 (2) ◽

pp. 253-281

Author(s):

XABIER ARTOLA ◽

AITOR SOROA

Keyword(s):

Conceptual Model ◽

Language Processing ◽

Information Integration ◽

Large Scale ◽

Query Language ◽

Critical Issue ◽

Object Identification ◽

Point Of View ◽

Lexical Resources ◽

Lexical Information

AbstractThe design and construction of lexical resources is a critical issue in Natural Language Processing (NLP). Real-world NLP systems need large-scale lexica, which provide rich information about words and word senses at all levels: morphologic, syntactic, lexical semantics, etc., but the construction of lexical resources is a difficult and costly task. The last decade has been highly influenced by the notion of reusability, that is, the use of the information of existing lexical resources in constructing new ones. It is unrealistic, however, to expect that the great variety of available lexical information resources could be converted into a single and standard representation schema in the near future. The purpose of this article is to present the ELHISA system, a software architecture for the integration of heterogeneous lexical information. We address, from the point of view of the information integration area, the problem of querying very different existing lexical information sources using a unique and common query language. The integration in ELHISA is performed in a logical way, so that the lexical resources do not suffer any modification when integrating them into the system. ELHISA is primarily defined as a consultation system for accessing structured lexical information, and therefore it does not have the capability to modify or update the underlying information. For this purpose, a General Conceptual Model (GCM) for describing diverse lexical data has been conceived. The GCM establishes a fixed vocabulary describing objects in the lexical information domain, their attributes, and the relationships among them. To integrate the lexical resources into the federation, a Source Conceptual Model (SCM) is built on the top of each one, which represents the lexical objects concurring in each particular source. To answer the user queries, ELHISA must access the integrated resources, and, hence, it must translate the query expressed in GCM terms into queries formulated in terms of the SCM of each source. The relation between the GCM and the SCMs is explicitly described by means of mapping rules called Content Description Rules. Data integration at the extensional level is achieved by means of the data cleansing process, needed if we want to compare the data arriving from different sources. In this process, the object identification step is carried out. Based on this architecture, a prototype named ELHISA has been built, and five resources covering a broad scope have been integrated into it so far for testing purposes. The fact that such heterogeneous resources have been integrated with ease into the system shows, in the opinion of the authors, the suitability of the approach taken.

Download Full-text

Stress Detection using Natural Language Processing and Machine Learning over social Interactions

10.21203/rs.3.rs-994868/v1 ◽

2021 ◽

Author(s):

Tanya Nijhawan ◽

Girija Attigeri ◽

Ananthakrishna T

Keyword(s):

Machine Learning ◽

Social Interactions ◽

Language Processing ◽

Large Scale ◽

Latent Dirichlet Allocation ◽

Well Being ◽

Machine Learning Algorithms ◽

Textual Data ◽

The Status ◽

Micro Level

Abstract Cyberspace is a vast soapbox for people to post anything that they witness in their day-to-day lives. Subsequently, it can be used as a very effective tool in detecting the stress levels of an individual based on the posts and comments shared by him/her on social networking platforms. We leverage large-scale datasets with tweets to successfully accomplish sentiment analysis with the aid of machine learning algorithms. We take the help of a capable deep learning pre-trained model called BERT to solve the problems which come with sentiment classification. The BERT model outperforms a lot of other well-known models for this job without any sophisticated architecture. We also adopted Latent Dirichlet Allocation which is an unsupervised machine learning method that’s skilled in scanning a group of documents, recognizing the word and phrase patterns within them, and gathering word groups and alike expressions that most precisely illustrate a set of documents. This helps us predict which topic is linked to the textual data. With the aid of the models suggested, we will be able to detect the emotion of users online. We are primarily working with Twitter data because Twitter is a website where people express their thoughts often. In conclusion, this proposal is for the well- being of one’s mental health. The results are evaluated using various metric at macro and micro level and indicate that the trained model detects the status of emotions bases on social interactions.

Download Full-text

Guidelines for normalising Early Modern English corpora: Decisions and justifications

ICAME Journal ◽

10.1515/icame-2015-0001 ◽

2015 ◽

Vol 39 (1) ◽

pp. 5-24 ◽

Cited By ~ 8

Author(s):

Dawn Archer ◽

Merja Kytö ◽

Alistair Baron ◽

Paul Rayson

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Early Modern ◽

Language Processing ◽

Corpus Linguistics ◽

Large Scale ◽

Historical Linguistics ◽

Early Modern English ◽

Textual Data ◽

Do So

Abstract Corpora of Early Modern English have been collected and released for research for a number of years. With large scale digitisation activities gathering pace in the last decade, much more historical textual data is now available for research on numerous topics including historical linguistics and conceptual history. We summarise previous research which has shown that it is necessary to map historical spelling variants to modern equivalents in order to successfully apply natural language processing and corpus linguistics methods. Manual and semiautomatic methods have been devised to support this normalisation and standardisation process. We argue that it is important to develop a linguistically meaningful rationale to achieve good results from this process. In order to do so, we propose a number of guidelines for normalising corpora and show how these guidelines have been applied in the Corpus of English Dialogues.

Download Full-text

"What Else Are You Worried About?" – Integrating Textual Responses into Quantitative Social Science Research

10.31219/osf.io/zwfxv ◽

2017 ◽

Author(s):

Julia Marie Rohrer ◽

Martin Brümmer ◽

Stefan C. Schmukle ◽

Jan Goebel ◽

Gert Wagner

Keyword(s):

Social Science ◽

Language Processing ◽

Large Scale ◽

Panel Study ◽

Social Science Research ◽

Science Research ◽

Automated Analysis ◽

Wide Range ◽

Textual Data ◽

Large Scale Survey

Open-ended questions have routinely been included in large-scale survey and panel studies, yet there is some perplexity about how to actually incorporate the answers to such questions into quantitative social science research. Tools developed recently in the domain of natural language processing offer a wide range of options for the automated analysis of such textual data, but their implementation has lagged behind. In this study, we demonstrate straightforward procedures that can be applied to process and analyze textual data for the purposes of quantitative social science research. Using more than 35,000 textual answers to the question “What else are you worried about?” from participants of the German Socio-economic Panel Study (SOEP), we (1) analyzed characteristics of respondents that determined whether they answered the open-ended question, (2) used the textual data to detect relevant topics that were reported by the respondents, and (3) linked the features of the respondents to the worries they reported in their textual data. The potential uses as well as the limitations of the automated analysis of textual data are discussed.

Download Full-text

From Dark to Light: The Many Shades of Sharing Misinformation Online

Media and Communication ◽

10.17645/mac.v9i1.3409 ◽

2021 ◽

Vol 9 (1) ◽

pp. 134-143 ◽

Cited By ~ 1

Author(s):

Miriam J. Metzger ◽

Andrew J. Flanagin ◽

Paul Mena ◽

Shan Jiang ◽

Christo Wilson

Keyword(s):

Social Networks ◽

Social Media ◽

Language Processing ◽

Large Scale ◽

Future Research ◽

Online Networks ◽

Wide Range ◽

Report Data ◽

Preliminary Study ◽

The Many

Research typically presumes that people believe misinformation and propagate it through their social networks. Yet, a wide range of motivations for sharing misinformation might impact its spread, as well as people’s belief of it. By examining research on motivations for sharing news information generally, and misinformation specifically, we derive a range of motivations that broaden current understandings of the sharing of misinformation to include factors that may to some extent mitigate the presumed dangers of misinformation for society. To illustrate the utility of our viewpoint we report data from a preliminary study of people’s dis/belief reactions to misinformation shared on social media using natural language processing. Analyses of over 2,5 million comments demonstrate that misinformation on social media is often disbelieved. These insights are leveraged to propose directions for future research that incorporate a more inclusive understanding of the various motivations and strategies for sharing misinformation socially in large-scale online networks.

Download Full-text

Integration of Ecological and Engineering Aspects in Planning Large Scale Tidal Power Development in the Bay of Fundy, Canada

Water Science & Technology ◽

10.2166/wst.1984.0060 ◽

1984 ◽

Vol 16 (1-2) ◽

pp. 281-295 ◽

Cited By ~ 3

Author(s):

Donald C Gordon

Keyword(s):

Large Scale ◽

Feasibility Studies ◽

Bay Of Fundy ◽

Final Decision ◽

Power Development ◽

Tidal Power ◽

History Of ◽

Scientists And Engineers ◽

The Many ◽

University Scientists

Large-scale tidal power development in the Bay of Fundy has been given serious consideration for over 60 years. There has been a long history of productive interaction between environmental scientists and engineers durinn the many feasibility studies undertaken. Up until recently, tidal power proposals were dropped on economic grounds. However, large-scale development in the upper reaches of the Bay of Fundy now appears to be economically viable and a pre-commitment design program is highly likely in the near future. A large number of basic scientific research studies have been and are being conducted by government and university scientists. Likely environmental impacts have been examined by scientists and engineers together in a preliminary fashion on several occasions. A full environmental assessment will be conducted before a final decision is made and the results will definately influence the outcome.

Download Full-text

A Novel Unsupervised Classification Method for Sandy Land Using Fully Polarimetric SAR Data

Remote Sensing ◽

10.3390/rs13030355 ◽

2021 ◽

Vol 13 (3) ◽

pp. 355

Author(s):

Weixian Tan ◽

Borong Sun ◽

Chenyu Xiao ◽

Pingping Huang ◽

Wei Xu ◽

...

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Feature Vector ◽

Unsupervised Classification ◽

Classification Method ◽

Sandy Land ◽

Classification Methods ◽

The Many ◽

Representative Points

Classification based on polarimetric synthetic aperture radar (PolSAR) images is an emerging technology, and recent years have seen the introduction of various classification methods that have been proven to be effective to identify typical features of many terrain types. Among the many regions of the study, the Hunshandake Sandy Land in Inner Mongolia, China stands out for its vast area of sandy land, variety of ground objects, and intricate structure, with more irregular characteristics than conventional land cover. Accounting for the particular surface features of the Hunshandake Sandy Land, an unsupervised classification method based on new decomposition and large-scale spectral clustering with superpixels (ND-LSC) is proposed in this study. Firstly, the polarization scattering parameters are extracted through a new decomposition, rather than other decomposition approaches, which gives rise to more accurate feature vector estimate. Secondly, a large-scale spectral clustering is applied as appropriate to meet the massive land and complex terrain. More specifically, this involves a beginning sub-step of superpixels generation via the Adaptive Simple Linear Iterative Clustering (ASLIC) algorithm when the feature vector combined with the spatial coordinate information are employed as input, and subsequently a sub-step of representative points selection as well as bipartite graph formation, followed by the spectral clustering algorithm to complete the classification task. Finally, testing and analysis are conducted on the RADARSAT-2 fully PolSAR dataset acquired over the Hunshandake Sandy Land in 2016. Both qualitative and quantitative experiments compared with several classification methods are conducted to show that proposed method can significantly improve performance on classification.

Download Full-text

Neural methods for effective, efficient, and exposure-aware information retrieval

ACM SIGIR Forum ◽

10.1145/3476415.3476434 ◽

2021 ◽

Vol 55 (1) ◽

pp. 1-2

Author(s):

Bhaskar Mitra

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Web Search ◽

Real Life ◽

Inverted Index ◽

Information Need ◽

Product Model ◽

Performance Improvements ◽

Deep Model

Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks. We ground our contributions with a detailed survey of the growing body of neural IR literature [Mitra and Craswell, 2018]. Our key contribution towards improving the effectiveness of deep ranking models is developing the Duet principle [Mitra et al., 2017] which emphasizes the importance of incorporating evidence based on both patterns of exact term matches and similarities between learned latent representations of query and document. To efficiently retrieve from large collections, we develop a framework to incorporate query term independence [Mitra et al., 2019] into any arbitrary deep model that enables large-scale precomputation and the use of inverted index for fast retrieval. In the context of stochastic ranking, we further develop optimization strategies for exposure-based objectives [Diaz et al., 2020]. Finally, this dissertation also summarizes our contributions towards benchmarking neural IR models in the presence of large training datasets [Craswell et al., 2019] and explores the application of neural methods to other IR tasks, such as query auto-completion.

Download Full-text

A Survey on Contrastive Self-Supervised Learning

Technologies ◽

10.3390/technologies9010002 ◽

2020 ◽

Vol 9 (1) ◽

pp. 2

Author(s):

Ashish Jaiswal ◽

Ashwin Ramesh Babu ◽

Mohammad Zaki Zadeh ◽

Debapriya Banerjee ◽

Fillia Makedon

Keyword(s):

Computer Vision ◽

Supervised Learning ◽

Language Processing ◽

Large Scale ◽

Performance Comparison ◽

Extensive Review ◽

Future Directions ◽

Dominant Component ◽

Supervised Methods ◽

The Cost

Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudolabels as supervision and use the learned representations for several downstream tasks. Specifically, contrastive learning has recently become a dominant component in self-supervised learning for computer vision, natural language processing (NLP), and other domains. It aims at embedding augmented versions of the same sample close to each other while trying to push away embeddings from different samples. This paper provides an extensive review of self-supervised methods that follow the contrastive approach. The work explains commonly used pretext tasks in a contrastive learning setup, followed by different architectures that have been proposed so far. Next, we present a performance comparison of different methods for multiple downstream tasks such as image classification, object detection, and action recognition. Finally, we conclude with the limitations of the current methods and the need for further techniques and future directions to make meaningful progress.

Download Full-text

Automated Data-Driven Generation of Personalized Pedagogical Interventions in Intelligent Tutoring Systems

International Journal of Artificial Intelligence in Education ◽

10.1007/s40593-021-00267-x ◽

2021 ◽

Author(s):

Ekaterina Kochmar ◽

Dung Do Vu ◽

Robert Belfer ◽

Varun Gupta ◽

Iulian Vlad Serban ◽

...

Keyword(s):

Machine Learning ◽

Student Performance ◽

Language Processing ◽

Intelligent Tutoring Systems ◽

Large Scale ◽

Intelligent Tutoring ◽

Performance Outcomes ◽

Data Driven ◽

Personalized Feedback ◽

Tutoring Systems

AbstractIntelligent tutoring systems (ITS) have been shown to be highly effective at promoting learning as compared to other computer-based instructional approaches. However, many ITS rely heavily on expert design and hand-crafted rules. This makes them difficult to build and transfer across domains and limits their potential efficacy. In this paper, we investigate how feedback in a large-scale ITS can be automatically generated in a data-driven way, and more specifically how personalization of feedback can lead to improvements in student performance outcomes. First, in this paper we propose a machine learning approach to generate personalized feedback in an automated way, which takes individual needs of students into account, while alleviating the need of expert intervention and design of hand-crafted rules. We leverage state-of-the-art machine learning and natural language processing techniques to provide students with personalized feedback using hints and Wikipedia-based explanations. Second, we demonstrate that personalized feedback leads to improved success rates at solving exercises in practice: our personalized feedback model is used in , a large-scale dialogue-based ITS with around 20,000 students launched in 2019. We present the results of experiments with students and show that the automated, data-driven, personalized feedback leads to a significant overall improvement of 22.95% in student performance outcomes and substantial improvements in the subjective evaluation of the feedback.

Download Full-text