Automated Text Classification of News Articles: A Practical Guide

2020 ◽  
Vol 29 (1) ◽  
pp. 19-42 ◽  
Author(s):  
Pablo Barberá ◽  
Amber E. Boydstun ◽  
Suzanna Linn ◽  
Ryan McMahon ◽  
Jonathan Nagler

Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.

Author(s):  
Miklos Sebők ◽  
Zoltán Kacsuk ◽  
Ákos Máté

AbstractThe classification of the items of ever-increasing textual databases has become an important goal for a number of research groups active in the field of computational social science. Due to the increased amount of text data there is a growing number of use-cases where the initial effort of human classifiers was successfully augmented using supervised machine learning (SML). In this paper, we investigate such a hybrid workflow solution classifying the lead paragraphs of New York Times front-page articles from 1996 to 2006 according to policy topic categories (such as education or defense) of the Comparative Agendas Project (CAP). The SML classification is conducted in multiple rounds and, within each round, we run the SML algorithm on n samples and n times if the given algorithm is non-deterministic (e.g., SVM). If all the SML predictions point towards a single label for a document, then it is classified as such (this approach is also called a “voting ensemble"). In the second step, we explore several scenarios, ranging from using the SML ensemble without human validation to incorporating active learning. Using these scenarios, we can quantify the gains from the various workflow versions. We find that using human coding and validation combined with an ensemble SML hybrid approach can reduce the need for human coding while maintaining very high precision rates and offering a modest to a good level of recall. The modularity of this hybrid workflow allows for various setups to address the idiosyncratic resource bottlenecks that a large-scale text classification project might face.


2020 ◽  
pp. 1-26
Author(s):  
Joshua Eykens ◽  
Raf Guns ◽  
Tim C.E. Engels

We compare two supervised machine learning algorithms—Multinomial Naïve Bayes and Gradient Boosting—to classify social science articles using textual data. The high level of granularity of the classification scheme used and the possibility that multiple categories are assigned to a document make this task challenging. To collect the training data, we query three discipline specific thesauri to retrieve articles corresponding to specialties in the classification. The resulting dataset consists of 113,909 records and covers 245 specialties, aggregated into 31 subdisciplines from three disciplines. Experts were consulted to validate the thesauri-based classification. The resulting multi-label dataset is used to train the machine learning algorithms in different configurations. We deploy a multi-label classifier chaining model, allowing for an arbitrary number of categories to be assigned to each document. The best results are obtained with Gradient Boosting. The approach does not rely on citation data. It can be applied in settings where such information is not available. We conclude that fine-grained text-based classification of social sciences publications at a subdisciplinary level is a hard task, for humans and machines alike. A combination of human expertise and machine learning is suggested as a way forward to improve the classification of social sciences documents.


Sensors ◽  
2021 ◽  
Vol 21 (3) ◽  
pp. 778
Author(s):  
Nitsa J. Herzog ◽  
George D. Magoulas

Early identification of degenerative processes in the human brain is considered essential for providing proper care and treatment. This may involve detecting structural and functional cerebral changes such as changes in the degree of asymmetry between the left and right hemispheres. Changes can be detected by computational algorithms and used for the early diagnosis of dementia and its stages (amnestic early mild cognitive impairment (EMCI), Alzheimer’s Disease (AD)), and can help to monitor the progress of the disease. In this vein, the paper proposes a data processing pipeline that can be implemented on commodity hardware. It uses features of brain asymmetries, extracted from MRI of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, for the analysis of structural changes, and machine learning classification of the pathology. The experiments provide promising results, distinguishing between subjects with normal cognition (NC) and patients with early or progressive dementia. Supervised machine learning algorithms and convolutional neural networks tested are reaching an accuracy of 92.5% and 75.0% for NC vs. EMCI, and 93.0% and 90.5% for NC vs. AD, respectively. The proposed pipeline offers a promising low-cost alternative for the classification of dementia and can be potentially useful to other brain degenerative disorders that are accompanied by changes in the brain asymmetries.


2020 ◽  
Author(s):  
Eoin Carley

<p>Solar flares are often associated with high-intensity radio emission known as `solar radio bursts' (SRBs). SRBs are generally observed in dynamic spectra and have five major spectral classes, labelled type I to type V depending on their shape and extent in frequency and time. Due to their morphological complexity, a challenge in solar radio physics is the automatic detection and classification of such radio bursts. Classification of SRBs has become necessary in recent years due to large data rates (3 Gb/s) generated by advanced radio telescopes such as the Low Frequency Array (LOFAR). Here we test the ability of several supervised machine learning algorithms to automatically classify type II and type III solar radio bursts. We test the detection accuracy of support vector machines (SVM), random forest (RF), as well as an implementation of transfer learning of the Inception and YOLO convolutional neural networks (CNNs). The training data was assembled from type II and III bursts observed by the Radio Solar Telescope Network (RSTN) from 1996 to 2018, supplemented by type II and III radio burst simulations. The CNNs were the best performers, often exceeding >90% accuracy on the validation set, with YOLO having the ability to perform radio burst burst localisation in dynamic spectra. This shows that machine learning algorithms (in particular CNNs) are capable of SRB classification, and we conclude by discussing future plans for the implementation of a CNN in the LOFAR for Space Weather (LOFAR4SW) data-stream pipelines.</p>


2017 ◽  
Vol 4 (1) ◽  
pp. 56-74 ◽  
Author(s):  
Abinash Tripathy ◽  
Santanu Kumar Rath

Sentiment analysis helps to determine hidden intention of the concerned author of any topic and provides an evaluation report on the polarity of any document. The polarity may be positive, negative or neutral. It is observed that very often the data associated with the sentiment analysis consist of the feedback given by various specialists on any topic or product. Thus, the review may be categorized properly into any sort of class based on the polarity, in order to have a good knowledge about the product. This article proposes an approach to classify the review dataset made on basis of sentiment analysis into different polarity groups. Four machine learning algorithms viz., Naive Bayes (NB), Support Vector Machine (SVM), Random Forest, and Linear Discriminant Analysis (LDA) have been considered in this paper for classification process. The obtained result on values of accuracy of the algorithms are critically examined by using different performance parameters, applied on two different datasets.


Sign in / Sign up

Export Citation Format

Share Document