The impact of differences in text segmentation on the automated quantitative evaluation of song-lyrics

Friederike Tegge; Katharina Parry

doi:10.1371/journal.pone.0241979

The impact of differences in text segmentation on the automated quantitative evaluation of song-lyrics

PLoS ONE ◽

10.1371/journal.pone.0241979 ◽

2020 ◽

Vol 15 (11) ◽

pp. e0241979

Author(s):

Friederike Tegge ◽

Katharina Parry

Keyword(s):

Language Processing ◽

Text Segmentation ◽

Extensive Experience ◽

Written Text ◽

Sentence Segmentation ◽

Sentence Level ◽

Textual Data ◽

Song Lyrics ◽

The Impact ◽

Text Normalization

The text-evaluation application Coh-Metrix and natural language processing rely on the sentence for text segmentation and analysis and frequently detect sentence limits by means of punctuation. Problems arise when target texts such as pop song lyrics do not follow formal standards of written text composition and lack punctuation in the original. In such cases it is common for human transcribers to prepare texts for analysis, often following unspecified or at least unreported rules of text normalization and relying potentially on an assumed shared understanding of the sentence as a text-structural unit. This study investigated whether the use of different transcribers to insert typographical symbols into song lyrics during the pre-processing of textual data can result in significant differences in sentence delineation. Results indicate that different transcribers (following commonly agreed-upon rules of punctuation based on their extensive experience with language and writing as language professionals) can produce differences in sentence segmentation. This has implications for the analysis results for at least some Coh-Metrix measures and highlights the problem of transcription, with potential consequences for quantification at and above sentence level. It is argued that when analyzing non-traditional written texts or transcripts of spoken language it is not possible to assume uniform text interpretation and segmentation during pre-processing. It is advisable to provide clear rules for text normalization at the pre-processing stage, and to make these explicit in documentation and publication.

Download Full-text

Text Segmentation

10.1093/oxfordhb/9780199276349.013.0010 ◽

2012 ◽

Author(s):

Andrei Mikheev

Keyword(s):

Natural Language Processing ◽

Language Processing ◽

Computational Analysis ◽

Electronic Text ◽

Text Segmentation ◽

Writing Systems ◽

Computational Approaches ◽

Sentence Segmentation ◽

Processing Techniques ◽

Linguistic Units

This article discusses electronic text as essentially just a sequence of characters. Text needs to be segmented at least into linguistic units such as words, punctuation, numbers, alphanumerics, etc. This process is called tokenization. The article mentions that most natural language processing techniques require text to be segmented into sentences as well. It briefly reviews some evaluation metrics and standard resources commonly used for text segmentation tasks. This article presents substantial challenges for computational analysis since tokens are directly attached to each other using pictogram characters or other native writing systems and outlines various computational approaches to tackle them in different languages. It focuses on the low-level tasks such as tokenization and sentence segmentation.

Download Full-text

Using Enhanced Lexicon-Based Approaches for the Determination of Aspect Categories and Their Polarities in Arabic Reviews

International Journal of Information Technology and Web Engineering ◽

10.4018/ijitwe.2016070102 ◽

2016 ◽

Vol 11 (3) ◽

pp. 15-31 ◽

Cited By ~ 11

Author(s):

Mohammad Al Smadi ◽

Islam Obaidat ◽

Mahmoud Al-Ayyoub ◽

Rami Mohawesh ◽

Yaser Jararweh

Keyword(s):

Natural Language ◽

Language Processing ◽

Web Mining ◽

English Language ◽

Arabic Language ◽

Sentence Level ◽

Textual Data ◽

Polarity Determination ◽

The Given

Sentiment Analysis (SA) is the process of determining the sentiment of a text written in a natural language to be positive, negative or neutral. It is one of the most interesting subfields of natural language processing (NLP) and Web mining due to its diverse applications and the challenges associated with applying it on the massive amounts of textual data available online (especially, on social networks). Most of the current work on SA focus on the English language and work on the sentence-level or the document-level. This work focuses on the less studied version of SA, which is aspect-based SA (ABSA) for the Arabic language. Specifically, this work considers two ABSA tasks: aspect category determination and aspect category polarity determination, and makes use of the publicly available human annotated Arabic dataset (HAAD) along with its baseline experiments conducted by HAAD providers. In this work, several lexicon-based approaches are presented for the two tasks at hand and show that some of the presented approaches significantly outperforms the best-known result on the given dataset. An enhancement of 9% and 46% were achieved in the tasks aspect category determination and aspect category polarity determination respectively.

Download Full-text

A Natural Language Processing Approach to Measuring Treatment Adherence and Consistency Using Semantic Similarity

AERA Open ◽

10.1177/23328584211028615 ◽

2021 ◽

Vol 7 ◽

pp. 233285842110286

Author(s):

Kylie L. Anglin ◽

Vivian C. Wong ◽

Arielle Boguslav

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Intervention Implementation ◽

Proof Of Concept ◽

Coaching Intervention ◽

Processing Techniques ◽

Teacher Coaching ◽

The Impact

Though there is widespread recognition of the importance of implementation research, evaluators often face intense logistical, budgetary, and methodological challenges in their efforts to assess intervention implementation in the field. This article proposes a set of natural language processing techniques called semantic similarity as an innovative and scalable method of measuring implementation constructs. Semantic similarity methods are an automated approach to quantifying the similarity between texts. By applying semantic similarity to transcripts of intervention sessions, researchers can use the method to determine whether an intervention was delivered with adherence to a structured protocol, and the extent to which an intervention was replicated with consistency across sessions, sites, and studies. This article provides an overview of semantic similarity methods, describes their application within the context of educational evaluations, and provides a proof of concept using an experimental study of the impact of a standardized teacher coaching intervention.

Download Full-text

Evaluating Cultural Impact in Discursive Space through Digital Footprints

Sustainability ◽

10.3390/su13074043 ◽

2021 ◽

Vol 13 (7) ◽

pp. 4043 ◽

Cited By ~ 1

Author(s):

Jesús López Baeza ◽

Jens Bley ◽

Kay Hartkopf ◽

Martin Niggemann ◽

James Arias ◽

...

Keyword(s):

Social Media ◽

Language Processing ◽

Social Activity ◽

Test Site ◽

Cultural Discourse ◽

Discursive Space ◽

Topic Identification ◽

Small Test ◽

The Impact ◽

The City

The research presented in this paper describes an evaluation of the impact of spatial interventions in public spaces, measured by social media data. This contribution aims at observing the way a spatial intervention in an urban location can affect what people talk about on social media. The test site for our research is Domplatz in the center of Hamburg, Germany. In recent years, several actions have taken place there, intending to attract social activity and spotlight the square as a landmark of cultural discourse in the city of Hamburg. To evaluate the impact of this strategy, textual data from the social networks Twitter and Instagram (i.e., tweets and image captions) are collected and analyzed using Natural Language Processing intelligence. These analyses identify and track the cultural topic or “people talking about culture” in the city of Hamburg. We observe the evolution of the cultural topic, and its potential correspondence in levels of activity, with certain intervention actions carried out in Domplatz. Two analytic methods of topic clustering and tracking are tested. The results show a successful topic identification and tracking with both methods, the second one being more accurate. This means that it is possible to isolate and observe the evolution of the city’s cultural discourse using NLP. However, it is shown that the effects of spatial interventions in our small test square have a limited local scale, rather than a city-wide relevance.

Download Full-text

A Bilingual Comparison of Sentiment and Topics for a Product Event on Twitter

Information Systems Frontiers ◽

10.1007/s10796-021-10169-x ◽

2021 ◽

Author(s):

Irina Wedel ◽

Michael Palk ◽

Stefan Voß

Keyword(s):

Social Media ◽

Language Processing ◽

Topic Modeling ◽

New Product ◽

Business Value ◽

Data Driven ◽

New Product Introduction ◽

Social Media Analytics ◽

Product Introduction ◽

Textual Data

AbstractSocial media enable companies to assess consumers’ opinions, complaints and needs. The systematic and data-driven analysis of social media to generate business value is summarized under the term Social Media Analytics which includes statistical, network-based and language-based approaches. We focus on textual data and investigate which conversation topics arise during the time of a new product introduction on Twitter and how the overall sentiment is during and after the event. The analysis via Natural Language Processing tools is conducted in two languages and four different countries, such that cultural differences in the tonality and customer needs can be identified for the product. Different methods of sentiment analysis and topic modeling are compared to identify the usability in social media and in the respective languages English and German. Furthermore, we illustrate the importance of preprocessing steps when applying these methods and identify relevant product insights.

Download Full-text

Machine Learning for Dissimulating Reality

Proceedings ◽

10.3390/proceedings2021077017 ◽

2021 ◽

Vol 77 (1) ◽

pp. 17

Author(s):

Andrea Giussani

Keyword(s):

Machine Learning ◽

Language Processing ◽

Huge Amount ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Technological Advances ◽

Textual Data ◽

Musical Scores ◽

Mathematical Formulas

In the last decade, advances in statistical modeling and computer science have boosted the production of machine-produced contents in different fields: from language to image generation, the quality of the generated outputs is remarkably high, sometimes better than those produced by a human being. Modern technological advances such as OpenAI’s GPT-2 (and recently GPT-3) permit automated systems to dramatically alter reality with synthetic outputs so that humans are not able to distinguish the real copy from its counteracts. An example is given by an article entirely written by GPT-2, but many other examples exist. In the field of computer vision, Nvidia’s Generative Adversarial Network, commonly known as StyleGAN (Karras et al. 2018), has become the de facto reference point for the production of a huge amount of fake human face portraits; additionally, recent algorithms were developed to create both musical scores and mathematical formulas. This presentation aims to stimulate participants on the state-of-the-art results in this field: we will cover both GANs and language modeling with recent applications. The novelty here is that we apply a transformer-based machine learning technique, namely RoBerta (Liu et al. 2019), to the detection of human-produced versus machine-produced text concerning fake news detection. RoBerta is a recent algorithm that is based on the well-known Bidirectional Encoder Representations from Transformers algorithm, known as BERT (Devlin et al. 2018); this is a bi-directional transformer used for natural language processing developed by Google and pre-trained over a huge amount of unlabeled textual data to learn embeddings. We will then use these representations as an input of our classifier to detect real vs. machine-produced text. The application is demonstrated in the presentation.

Download Full-text

Two New Large Corpora for Vietnamese Aspect-based Sentiment Analysis at Sentence Level

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3446678 ◽

2021 ◽

Vol 20 (4) ◽

pp. 1-22

Author(s):

Dang Van Thin ◽

Ngan Luu-Thuy Nguyen ◽

Tri Minh Truong ◽

Lac Si Le ◽

Duy Tin Vo

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

Network Architectures ◽

Low Resource ◽

The Neural Network ◽

Sentence Level ◽

Push Forward ◽

Polarity Classification ◽

Learning Architectures ◽

Single Approach

Aspect-based sentiment analysis has been studied in both research and industrial communities over recent years. For the low-resource languages, the standard benchmark corpora play an important role in the development of methods. In this article, we introduce two benchmark corpora with the largest sizes at sentence-level for two tasks: Aspect Category Detection and Aspect Polarity Classification in Vietnamese. Our corpora are annotated with high inter-annotator agreements for the restaurant and hotel domains. The release of our corpora would push forward the low-resource language processing community. In addition, we deploy and compare the effectiveness of supervised learning methods with a single and multi-task approach based on deep learning architectures. Experimental results on our corpora show that the multi-task approach based on BERT architecture outperforms the neural network architectures and the single approach. Our corpora and source code are published on this footnoted site. 1

Download Full-text

A Cascaded Unsupervised Model for PoS Tagging

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3447759 ◽

2021 ◽

Vol 20 (1) ◽

pp. 1-23

Author(s):

Necva Bölücü ◽

Burcu Can

Keyword(s):

Linear Model ◽

Language Processing ◽

Bayesian Model ◽

Linear Models ◽

Syntactic Category ◽

Semantic Parsing ◽

Pos Tagging ◽

Part Of Speech ◽

Sentence Level ◽

Log Linear

Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing, as it assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective, etc.). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g., dependency parsing) and thereby extract the meaning of the sentence (e.g., semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.

Download Full-text

Leveraging Twitter Using Artificial Intelligence to Explore Mental Health Insights in the UK During the COVID-19 Pandemic (Preprint)

10.2196/preprints.32449 ◽

2021 ◽

Author(s):

Christopher Marshall ◽

Kate Lanyi ◽

Rhiannon Green ◽

Georgie Wilkins ◽

Fiona Pearson ◽

...

Keyword(s):

Mental Health ◽

Artificial Intelligence ◽

Language Processing ◽

Mental Health Problems ◽

Well Being ◽

Research Activity ◽

Health Crisis ◽

Key Topics ◽

The Uk ◽

The Impact

BACKGROUND There is increasing need to explore the value of soft-intelligence, leveraged using the latest artificial intelligence (AI) and natural language processing (NLP) techniques, as a source of analysed evidence to support public health research activity and decision-making. OBJECTIVE The aim of this study was to further explore the value of soft-intelligence analysed using AI through a case study, which examined a large collection of UK tweets relating to mental health during the COVID-19 pandemic. METHODS A search strategy comprising a list of terms related to mental health, COVID-19, and lockdown restrictions was developed to prospectively collate relevant tweets via Twitter’s advanced search application programming interface over a 24-week period. We deployed a specialist NLP platform to explore tweet frequency and sentiment across the UK and identify key topics of discussion. A series of keyword filters were used to clean the initial data retrieved and also set up to track specific mental health problems. Qualitative document analysis was carried out to further explore and expand upon the results generated by the NLP platform. All collated tweets were anonymised RESULTS We identified and analysed 286,902 tweets posted from UK user accounts from 23 July 2020 to 6 January 2021. The average sentiment score was 50%, suggesting overall neutral sentiment across all tweets over the study period. Major fluctuations in volume and sentiment appeared to coincide with key changes to any local and/or national social-distancing measures. Tweets around mental health were polarising, discussed with both positive and negative sentiment. Key topics of consistent discussion over the study period included the impact of the pandemic on people’s mental health (both positively and negatively), fear and anxiety over lockdowns, and anger and mistrust toward the government. CONCLUSIONS Through the primary use of an AI-based NLP platform, we were able to rapidly mine and analyse emerging health-related insights from UK tweets into how the pandemic may be impacting people’s mental health and well-being. This type of real-time analysed evidence could act as a useful intelligence source that agencies, local leaders, and health care decision makers can potentially draw from, particularly during a health crisis.

Download Full-text

Abstract P253: Proton Pump Inhibitor Use is Positively Associated with Incidence of Cardiovascular Disease

Circulation ◽

10.1161/circ.135.suppl_1.p253 ◽

2017 ◽

Vol 135 (suppl_1) ◽

Author(s):

Elizabeth J Bell ◽

Jennifer L St. Sauver ◽

Veronique L Roger ◽

Nicholas B Larson ◽

Hongfang Liu ◽

...

Keyword(s):

Heart Failure ◽

Cardiovascular Disease ◽

Coronary Heart Disease ◽

Heart Disease ◽

Language Processing ◽

Proton Pump ◽

Hazard Ratio ◽

Time Varying ◽

Increased Risk ◽

The Impact

Introduction: Proton pump inhibitors (PPIs) are used by an estimated 29 million Americans. PPIs increase the levels of asymmetrical dimethylarginine, a known risk factor for cardiovascular disease (CVD). Data from a select population of patients with CVD suggest that PPI use is associated with an increased risk of stroke, heart failure, and coronary heart disease. The impact of PPI use on incident CVD is largely unknown in the general population. Hypothesis: We hypothesized that PPI users have a higher risk of incident total CVD, coronary heart disease, stroke, and heart failure compared to nonusers. To demonstrate specificity of association, we additionally hypothesized that there is not an association between use of H 2 -blockers - another commonly used class of medications with similar indications as PPIs - and CVD. Methods: We used the Rochester Epidemiology Project’s medical records-linkage system to identify all residents of Olmsted County, MN on our baseline date of January 1, 2004 (N=140217). We excluded persons who did not grant permission for their records to be used for research, were <18 years old, had a history of CVD, had missing data for any variable included in our model, or had evidence of PPI use within the previous year.We followed our final cohort (N=58175) for up to 12 years. The administrative censoring date for CVD was 1/20/2014, for coronary heart disease was 8/3/2016, for stroke was 9/9/2016, and for heart failure was 1/20/2014. Time-varying PPI ever-use was ascertained using 1) natural language processing to capture unstructured text from the electronic health record, and 2) outpatient prescriptions. An incident CVD event was defined as the first occurrence of 1) validated heart failure, 2) validated coronary heart disease, or 3) stroke, defined using diagnostic codes only. As a secondary analysis, we calculated the association between time-varying H 2 -blocker ever-use and CVD among persons not using H 2 -blockers at baseline. Results: After adjustment for age, sex, race, education, hypertension, hyperlipidemia, diabetes, and body-mass-index, PPI use was associated with an approximately 50% higher risk of CVD (hazard ratio [95% CI]: 1.51 [1.37-1.67]; 2187 CVD events), stroke (hazard ratio [95% CI]: 1.49 [1.35-1.65]; 1928 stroke events), and heart failure (hazard ratio [95% CI]: 1.56 [1.23-1.97]; 353 heart failure events) compared to nonusers. Users of PPIs had a 35% greater risk of coronary heart disease than nonusers (95% CI: 1.13-1.61; 626 coronary heart disease events). Use of H 2 -blockers was also associated with a higher risk of CVD (adjusted hazard ratio [95% CI]: 1.23 [1.08-1.41]; 2331 CVD events). Conclusions: PPI use is associated with a higher risk of CVD, coronary heart disease, stroke and heart failure. Use of a drug with no known cardiac toxicity - H 2 -blockers - was also associated with a greater risk of CVD, warranting further study.

Download Full-text