Automated Text Classification of News Articles: A Practical Guide

Pablo Barberá; Amber E. Boydstun; Suzanna Linn; Ryan McMahon; Jonathan Nagler

doi:10.1017/pan.2020.8

Automated Text Classification of News Articles: A Practical Guide

Political Analysis ◽

10.1017/pan.2020.8 ◽

2020 ◽

Vol 29 (1) ◽

pp. 19-42 ◽

Cited By ~ 1

Author(s):

Pablo Barberá ◽

Amber E. Boydstun ◽

Suzanna Linn ◽

Ryan McMahon ◽

Jonathan Nagler

Keyword(s):

New York ◽

New York Times ◽

Fixed Number ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Methodological Choices ◽

Units Of Analysis ◽

Human Validation ◽

Corpus Selection

Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.

Download Full-text

The (real) need for a human touch: testing a human–machine hybrid topic classification workflow on a New York Times corpus

Quality & Quantity ◽

10.1007/s11135-021-01287-4 ◽

2021 ◽

Author(s):

Miklos Sebők ◽

Zoltán Kacsuk ◽

Ákos Máté

Keyword(s):

New York ◽

Large Scale ◽

Hybrid Approach ◽

New York Times ◽

Computational Social Science ◽

Supervised Machine Learning ◽

Front Page ◽

Text Data ◽

Human Validation ◽

Classification Project

AbstractThe classification of the items of ever-increasing textual databases has become an important goal for a number of research groups active in the field of computational social science. Due to the increased amount of text data there is a growing number of use-cases where the initial effort of human classifiers was successfully augmented using supervised machine learning (SML). In this paper, we investigate such a hybrid workflow solution classifying the lead paragraphs of New York Times front-page articles from 1996 to 2006 according to policy topic categories (such as education or defense) of the Comparative Agendas Project (CAP). The SML classification is conducted in multiple rounds and, within each round, we run the SML algorithm on n samples and n times if the given algorithm is non-deterministic (e.g., SVM). If all the SML predictions point towards a single label for a document, then it is classified as such (this approach is also called a “voting ensemble"). In the second step, we explore several scenarios, ranging from using the SML ensemble without human validation to incorporating active learning. Using these scenarios, we can quantify the gains from the various workflow versions. We find that using human coding and validation combined with an ensemble SML hybrid approach can reduce the need for human coding while maintaining very high precision rates and offering a modest to a good level of recall. The modularity of this hybrid workflow allows for various setups to address the idiosyncratic resource bottlenecks that a large-scale text classification project might face.

Download Full-text

P.1.b.003 Supervised machine learning algorithms predict “correct” classification of retinal ganglia neuron subtypes

European Neuropsychopharmacology ◽

10.1016/s0924-977x(08)70261-x ◽

2008 ◽

Vol 18 ◽

pp. S218

Author(s):

S. Matthews ◽

H. Jelinek ◽

C.S. McLachlan ◽

I. Spence

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Correct Classification

Download Full-text

Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches

Quantitative Science Studies ◽

10.1162/qss_a_00106 ◽

2020 ◽

pp. 1-26

Author(s):

Joshua Eykens ◽

Raf Guns ◽

Tim C.E. Engels

Keyword(s):

Social Sciences ◽

Machine Learning ◽

Social Science ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Gradient Boosting ◽

Fine Grained ◽

Textual Data

We compare two supervised machine learning algorithms—Multinomial Naïve Bayes and Gradient Boosting—to classify social science articles using textual data. The high level of granularity of the classification scheme used and the possibility that multiple categories are assigned to a document make this task challenging. To collect the training data, we query three discipline specific thesauri to retrieve articles corresponding to specialties in the classification. The resulting dataset consists of 113,909 records and covers 245 specialties, aggregated into 31 subdisciplines from three disciplines. Experts were consulted to validate the thesauri-based classification. The resulting multi-label dataset is used to train the machine learning algorithms in different configurations. We deploy a multi-label classifier chaining model, allowing for an arbitrary number of categories to be assigned to each document. The best results are obtained with Gradient Boosting. The approach does not rely on citation data. It can be applied in settings where such information is not available. We conclude that fine-grained text-based classification of social sciences publications at a subdisciplinary level is a hard task, for humans and machines alike. A combination of human expertise and machine learning is suggested as a way forward to improve the classification of social sciences documents.

Download Full-text

Application of supervised machine learning algorithms in the classification of sagittal gait patterns of cerebral palsy children with spastic diplegia

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2019.01.009 ◽

2019 ◽

Vol 106 ◽

pp. 33-39 ◽

Cited By ~ 13

Author(s):

Yanxin Zhang ◽

Ye Ma

Keyword(s):

Machine Learning ◽

Cerebral Palsy ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Spastic Diplegia ◽

Gait Patterns

Download Full-text

Advanced Supervised Machine Learning Algorithms for Efficient Electrofacies Classification of a Carbonate Reservoir in a Giant Southern Iraqi Oil Field

10.4043/30906-ms ◽

2020 ◽

Cited By ~ 1

Author(s):

Watheq J Al-Mudhafar

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Carbonate Reservoir ◽

Oil Field ◽

Machine Learning Algorithms ◽

Supervised Machine Learning

Download Full-text

Application of supervised machine learning algorithms for the classification of regulatory RNA riboswitches

Briefings in Functional Genomics ◽

10.1093/bfgp/elw005 ◽

2016 ◽

pp. elw005 ◽

Cited By ~ 5

Author(s):

Swadha Singh ◽

Raghvendra Singh

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Regulatory Rna

Download Full-text

Brain Asymmetry Detection and Machine Learning Classification for Diagnosis of Early Dementia

Sensors ◽

10.3390/s21030778 ◽

2021 ◽

Vol 21 (3) ◽

pp. 778

Author(s):

Nitsa J. Herzog ◽

George D. Magoulas

Keyword(s):

Machine Learning ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Structural Changes ◽

Low Cost ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Progressive Dementia ◽

Machine Learning Classification

Early identification of degenerative processes in the human brain is considered essential for providing proper care and treatment. This may involve detecting structural and functional cerebral changes such as changes in the degree of asymmetry between the left and right hemispheres. Changes can be detected by computational algorithms and used for the early diagnosis of dementia and its stages (amnestic early mild cognitive impairment (EMCI), Alzheimer’s Disease (AD)), and can help to monitor the progress of the disease. In this vein, the paper proposes a data processing pipeline that can be implemented on commodity hardware. It uses features of brain asymmetries, extracted from MRI of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, for the analysis of structural changes, and machine learning classification of the pathology. The experiments provide promising results, distinguishing between subjects with normal cognition (NC) and patients with early or progressive dementia. Supervised machine learning algorithms and convolutional neural networks tested are reaching an accuracy of 92.5% and 75.0% for NC vs. EMCI, and 93.0% and 90.5% for NC vs. AD, respectively. The proposed pipeline offers a promising low-cost alternative for the classification of dementia and can be potentially useful to other brain degenerative disorders that are accompanied by changes in the brain asymmetries.

Download Full-text

An Approach to Improving the Classification of the New York Times Annotated Corpus

Knowledge Engineering and the Semantic Web - Communications in Computer and Information Science ◽

10.1007/978-3-642-41360-5_7 ◽

2013 ◽

pp. 83-91 ◽

Cited By ~ 3

Author(s):

Elena Mozzherina

Keyword(s):

New York ◽

New York Times ◽

The New York Times

Download Full-text

Using supervised machine learning to automatically detect type II and III solar radio bursts

10.5194/egusphere-egu2020-5109 ◽

2020 ◽

Author(s):

Eoin Carley

Keyword(s):

Machine Learning ◽

Radio Burst ◽

Solar Radio ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Radio Bursts ◽

Type Ii ◽

Solar Radio Bursts ◽

Dynamic Spectra

<p>Solar flares are often associated with high-intensity radio emission known as `solar radio bursts' (SRBs). SRBs are generally observed in dynamic spectra and have five major spectral classes, labelled type I to type V depending on their shape and extent in frequency and time. Due to their morphological complexity, a challenge in solar radio physics is the automatic detection and classification of such radio bursts. Classification of SRBs has become necessary in recent years due to large data rates (3 Gb/s) generated by advanced radio telescopes such as the Low Frequency Array (LOFAR). Here we test the ability of several supervised machine learning algorithms to automatically classify type II and type III solar radio bursts. We test the detection accuracy of support vector machines (SVM), random forest (RF), as well as an implementation of transfer learning of the Inception and YOLO convolutional neural networks (CNNs). The training data was assembled from type II and III bursts observed by the Radio Solar Telescope Network (RSTN) from 1996 to 2018, supplemented by type II and III radio burst simulations. The CNNs were the best performers, often exceeding >90% accuracy on the validation set, with YOLO having the ability to perform radio burst burst localisation in dynamic spectra. This shows that machine learning algorithms (in particular CNNs) are capable of SRB classification, and we conclude by discussing future plans for the implementation of a CNN in the LOFAR for Space Weather (LOFAR4SW) data-stream pipelines.</p>

Download Full-text

Classification of Sentiment of Reviews using Supervised Machine Learning Techniques

International Journal of Rough Sets and Data Analysis ◽

10.4018/ijrsda.2017010104 ◽

2017 ◽

Vol 4 (1) ◽

pp. 56-74 ◽

Cited By ~ 14

Author(s):

Abinash Tripathy ◽

Santanu Kumar Rath

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

Performance Parameters ◽

Linear Discriminant ◽

Learning Techniques

Sentiment analysis helps to determine hidden intention of the concerned author of any topic and provides an evaluation report on the polarity of any document. The polarity may be positive, negative or neutral. It is observed that very often the data associated with the sentiment analysis consist of the feedback given by various specialists on any topic or product. Thus, the review may be categorized properly into any sort of class based on the polarity, in order to have a good knowledge about the product. This article proposes an approach to classify the review dataset made on basis of sentiment analysis into different polarity groups. Four machine learning algorithms viz., Naive Bayes (NB), Support Vector Machine (SVM), Random Forest, and Linear Discriminant Analysis (LDA) have been considered in this paper for classification process. The obtained result on values of accuracy of the algorithms are critically examined by using different performance parameters, applied on two different datasets.

Download Full-text