Adapting for Informal Language in Arabic Twitter Improves Monitoring of COVID-19 Pandemic and Influenza Epidemic (Preprint)
BACKGROUND Twitter is a real time messaging platform widely used by people and organisations to share information on many topics. It could potentially be useful to analyse tweets for infectious disease monitoring purposes in order to reduce reporting lag time, and to provide an independent complementary source of data, compared to traditional approaches. However, such analysis is currently not possible in the Arabic speaking world due to lack of basic building blocks for research. OBJECTIVE We collect around 4,000 Arabic tweets related to COVID-19 and Influenza. We clean and label the tweets relative to the Arabic Infectious Diseases Ontology which includes non-standard terminology and 11 core concepts and 21 relations. The aim of this study is to analyse Arabic tweets to estimate their usefulness for health surveillance, understand the impact of the informal terms in the analysis, show the effect of the deep learning methods in the classification process, and identify the locations where the infection is spreading. METHODS We apply multi-label classification techniques: Binary Relevance, Classifier Chains, Label Powerset, Adapted Algorithm (MLKNN), NBSVM, BERT, and AraBERT to identify infected people. We also use Named Entity Recognition to predict the locations affected. RESULTS We achieve an F1-score up to 88% in the Influenza case study and 94% in the COVID-19 one. Adapting for non-standard terminology and informal language helps to improve accuracy by as much as 15% with an average improvement of 8%. Deep learning methods achieve around 5% on hamming loss during the classifying process. Our geo-location detection algorithm can predict on average 54% accuracy for the location of the users using tweet content. CONCLUSIONS This study identifies two Arabic social media datasets for monitoring tweets related to Influenza and COVID-19. It demonstrates the importance of including informal terms, which is regularly used by social media users, in the analysis. It also proves that BERT achieves good results when used with new terms in COVID-19 tweets. Finally, the tweet content may contain useful information to determine the location of the disease spread.