A Fuzzy Technique for On-Line Aggregation of POIs from Social Media: Definition and Comparison with Off-Line Random-Forest Classifiers

Giuseppe Psaila; Maurizio Toccu

doi:10.3390/info10120388

A Fuzzy Technique for On-Line Aggregation of POIs from Social Media: Definition and Comparison with Off-Line Random-Forest Classifiers

Information ◽

10.3390/info10120388 ◽

2019 ◽

Vol 10 (12) ◽

pp. 388

Author(s):

Giuseppe Psaila ◽

Maurizio Toccu

Keyword(s):

Social Media ◽

Random Forest ◽

Data Sets ◽

Public Place ◽

Public Places ◽

Machine Learning Classification ◽

Classification Technique ◽

Points Of Interest ◽

On Line ◽

Source Of Information

Social media represent an inexhaustible source of information concerning public places (also called points of interest (POIs)), provided by users. Several social media own and publish huge and independently-built corpora of data about public places which are not linked each other. An aggregated view of information concerning the same public place could be extremely useful, but social media are not immutable sources, thus the off-line approach adopted in all previous research works cannot provide up-to-date information in real time. In this work, we address the problem of on-line aggregating geo-located descriptors of public places provided by social media. The on-line approach makes impossible to adopt machine-learning (classification) techniques, trained on previously gathered data sets. We overcome the problem by adopting an approach based on fuzzy logic: we define a binary fuzzy relation, whose on-line evaluation allows for deciding if two public-place descriptors coming from different social media actually describe the same public place. We tested our technique on three data sets, describing public places in Manchester (UK), Genoa (Italy) and Stuttgart (Germany); the comparison with the off-line classification technique called “random forest” proved that our on-line technique obtains comparable results.

Download Full-text

Protein Classification using Machine Learning and Statistical Techniques

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813666190925163758 ◽

2019 ◽

Vol 13 ◽

Author(s):

Chhote Lal Prasad Gupta ◽

Anand Bihari ◽

Sudhakar Tripathi

Keyword(s):

Machine Learning ◽

Random Forest ◽

Protein Classification ◽

Classification Techniques ◽

Random Forest Classification ◽

Machine Learning Classification ◽

Classification Technique ◽

Human Enzyme ◽

Clinical Verification ◽

Enzyme Class

Background: In recent era prediction of enzyme class from an unknown protein is one of the challenging tasks in bioinformatics. Day to day the number of proteins increases that causes difficulties in clinical verification and classification; as a result, the prediction of enzyme class gives a new opportunity to bioinformatics scholars. The machine learning classification technique helps in protein classification and predictions. But it is imperative to know which classification technique is more suited for protein classification. This study used human proteins data that is extracted from UniProtKB databank. Total 4368 protein data with 45 identified features has been used for experimental analysis. Objective: The prime objective of this article is to find an appropriate classification technique to classify the reviewed as well as un-reviewed human enzyme class of protein data. Also find the significance of different features in protein classification and prediction. Method: In this article, the ten most significant classification techniques such as CRT, QUEST, CHAID, C5.0, ANN, SVM, Bayesian, Random Forest, XgBoost and CatBoost has been used to classify the data and know the importance of features. To validate the result of different classification technique, the accuracy, precision, recall, F-measures, sensitivity, specificity, MCC, ROC and AUROC has been used. All experiment has been done with the help of SPSS Clementine and Python. Result: Above discussed classification techniques give different results and found that the data are imbalanced for class C4, C5, and C6. As a result, all of the classification technique gives acceptable accuracy above of 60% for these classes of data, but their precision value is very less or negligible. The experimental results highlight that the Random forest gives highest accuracy as well as AUROC among all, i.e., 96.84% and 0.945 respectively. And also have high precision and recall value. Conclusion: The experiment conducted and analyzed in this article highlight that the Random Forest classification technique can be used for protein of human enzyme classification and predictions.

Download Full-text

Impact of the COVID-19 pandemic on the expression of emotions in social media

Multiple Criteria Decision Making ◽

10.22367/mcdm.2020.15.02 ◽

2020 ◽

Vol 15 ◽

pp. 23-35

Author(s):

Debabrata Ghosh ◽

Keyword(s):

Social Media ◽

Logistic Regression ◽

Random Forest ◽

Naive Bayes ◽

Naïve Bayes ◽

Support Vector ◽

Emotion Classification ◽

Machine Learning Classification ◽

Expression Of Emotions ◽

The Mind

In the age of social media, every second thousands of messages are exchanged. Analyzing those unstructured data to find out specific emotions is a challenging task. Analysis of emotions involves evaluation and classification of text into emotion classes such as Happy, Sad, Anger, Disgust, Fear, Surprise, as defined by emotion dimensional models which are described in the theory of psychology (www 1; Russell, 2005). The main goal of this paper is to cover the COVID-19 pandemic situation in India and its impact on human emotions. As people very often express their state of the mind through social media, analyzing and tracking their emotions can be very effective for government and local authorities to take required measures. We have analyzed different machine learning classification models, such as Naïve Bayes, Support Vector Machine, Random Forest Classifier, Decision Tree and Logistic Regression with 10-fold cross validation to find out top ML models for emotion classification. After tuning the Hyperparameter, we got Logistic regression as the best suited model with accuracy 77% with the given datasets. We worked on algorithm based supervised ML technique to get the expected result. Although multiple studies were conducted earlier along the same lines, none of them performed comparative study among different ML techniques or hyperparameter tuning to optimize the results. Besides, this study has been done on the dataset of the most recent COVID-19 pandemic situation, which is itself unique. We captured Twitter data for a duration of 45 days with hashtag #COVID19India OR #COVID19 and analyzed the data using Logistic Regression to find out how the emotion changed over time based on certain social factors. Keywords: classification, COVID-19, emotion, emotion analysis, Naïve Bayes, Pandemic, Random Forest, SVM.

Download Full-text

A Study on Estimating Land Value Distribution for the Talingchan District, Bangkok Using Points-of-Interest Data and Machine Learning Classification

Applied Sciences ◽

10.3390/app112211029 ◽

2021 ◽

Vol 11 (22) ◽

pp. 11029

Author(s):

Morakot Worachairungreung ◽

Kunyaphat Thanakunwutthirot ◽

Sarawut Ninsawat

Keyword(s):

Machine Learning ◽

Random Forest ◽

Real Estate ◽

Value Distribution ◽

Grocery Store ◽

Support Vector ◽

Land Value ◽

Factors Affecting ◽

Machine Learning Classification ◽

Points Of Interest

Land is an essential factor in real estate developments, and each location has its unique characteristics. Land value is a vital cost of real estate developments. Higher land costs mean that project developers must create higher valued products to cover the higher land costs and to maintain a profit level from their developments. Land values vary according to surrounding factors, such as environment, social, and economic situations. Machine learning is a popular data estimation technique that enables a system to learn from sample data; however, there are few studies on its use for estimating land value distribution. Therefore, we aim to apply the technique of machine learning to estimate land value and to investigate the factors affecting the land value in the Talingchan district, Bangkok., we used land value level as the dependent variable, with other factors affecting land value levels as the independent variables. Ten points of interest were chosen from Google Places API. Then, three machine learning algorithms, namely CART, random forest, support vector machine, were applied. For this study, we selected 45,032 land parcels as the experimental data and randomly divided them into two groups. The first 70% of the land parcels was used to create the training area. The other 30% of the land parcels was used to create the testing area to verify the accuracy of the land value estimation from the applied machine learning techniques. The most accurate machine learning results were produced by random forest, which were then used to measure the factor importance. The academic group factor was school, and the commercial group factors were clothing store, pharmacy, convenience store, hawker stall, grocery store, automatic teller machine, supermarket, restaurant, and company.

Download Full-text

Confidentiality of Statistical Records: A Threat-Monitoring Scheme for On Line Dialogue

Methods of Information in Medicine ◽

10.1055/s-0038-1635718 ◽

1976 ◽

Vol 15 (01) ◽

pp. 36-42 ◽

Cited By ~ 14

Author(s):

J. Schlörer

Keyword(s):

Statistical Data ◽

Cost Benefit ◽

Data Bank ◽

High Ratio ◽

Point Of View ◽

Data Sets ◽

Monitoring Scheme ◽

Access Controls ◽

On Line ◽

Bona Fide

From a statistical data bank containing only anonymous records, the records sometimes may be identified and then retrieved, as personal records, by on line dialogue. The risk mainly applies to statistical data sets representing populations, or samples with a high ratio n/N. On the other hand, access controls are unsatisfactory as a general means of protection for statistical data banks, which should be open to large user communities. A threat monitoring scheme is proposed, which will largely block the techniques for retrieval of complete records. If combined with additional measures (e.g., slight modifications of output), it may be expected to render, from a cost-benefit point of view, intrusion attempts by dialogue valueless, if not absolutely impossible. The bona fide user has to pay by some loss of information, but considerable flexibility in evaluation is retained. The proposal of controlled classification included in the scheme may also be useful for off line dialogue systems.

Download Full-text

Selection of one-dimensional sedimentation: models for on-line use

Water Science & Technology ◽

10.2166/wst.1995.0100 ◽

1995 ◽

Vol 31 (2) ◽

pp. 193-204 ◽

Cited By ~ 7

Author(s):

Koen Grijspeerdt ◽

Peter Vanrolleghem ◽

Willy Verstraete

Keyword(s):

Steady State ◽

Selection Criteria ◽

Data Sets ◽

Concentration Profiles ◽

A Posteriori ◽

One Dimensional ◽

On Line ◽

Dynamic Concentration ◽

Selection Of ◽

Modelling Task

A comparative study of several recently proposed one-dimensional sedimentation models has been made. This has been achieved by fitting these models to steady-state and dynamic concentration profiles obtained in a down-scaled secondary decanter. The models were evaluated with several a posteriori model selection criteria. Since the purpose of the modelling task is to do on-line simulations, the calculation time was used as one of the selection criteria. Finally, the practical identifiability of the models for the available data sets was also investigated. It could be concluded that the model of Takács et al. (1991) gave the most reliable results.

Download Full-text

COVID-19: knowledge, awareness and perceived stress among Jordanian healthcare providers (Preprint)

10.2196/preprints.22978 ◽

2020 ◽

Author(s):

Emad Aborajooh ◽

Mohammed Qussay Al-Sabbagh ◽

Baraa Mafrachi ◽

Muhammad Yassin ◽

Rami Dwairi ◽

...

Keyword(s):

Social Media ◽

Psychological Stress ◽

Health Care Providers ◽

Best Practice ◽

Healthcare Providers ◽

Cross Sectional Study ◽

Care Providers ◽

Cross Sectional ◽

Ordinal Logistic Regression Analysis ◽

Source Of Information

UNSTRUCTURED We aimed to measure levels of knowledge, awareness, and stress about COVID-19 among health care providers (HCP) in Jordan. This was a cross-sectional study on 397 HCPs that utilized an internet-based questionnaire to evaluate knowledge about COVID-19, availability of personal protective equipment (PEE), future perception, and psychological distress. Ordinal logistic regression analysis was used to evaluate factors associated with knowledge and psychological stress. Overall, 24.4% and 21.2% of the participants showed excellent knowledge and poor knowledge, respectively. Social media (61.7%) was the most commonly used source of information. Being female (β= 0.521, 95% CI 0.049 to 0.992), physician (β=1.421, 95% CI 0.849 to 1.992), or using literature to gain knowledge (β= 1.161, 95% CI 0.657 to 1.664) were positive predictors of higher knowledge. While having higher stress (β= -0.854, 95% CI -1.488 to -0.221) and using social media (β= -0.434, 95% CI -0.865 to -0.003) or conventional media (β= -0.884, 95% CI -1.358 to -0.409) for information were negative predictors of knowledge levels. HCPs are advised to use the literature as a source of information about the virus, its transmission, and the best practice. PPEs should be secured for HCPs to the psychological stress associated with treating COVID-19 patients.

Download Full-text

Women’s voices on social media: the advent of feminist epidemiology?

Emerging Themes in Epidemiology ◽

10.1186/s12982-021-00097-1 ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Céline Miani ◽

Yudit Namer

Keyword(s):

Social Media ◽

Third Space ◽

Peer Groups ◽

Research Priorities ◽

Research Process ◽

Power Dynamics ◽

Women's Voices ◽

New Public Health ◽

New Public ◽

On Line

Abstract Background Social media have in recent years challenged the way in which research questions are formulated in epidemiology and medicine, and in particular when it comes to women’s health. They have contributed to the emergence of ‘new’ public health topics (e.g. gynaecological and obstetric violence, long-Covid), the unearthing of testimonials of medical injustice, and in some cases, the creation of new evidence and changes in medical practice. Main text From a theoretical and methodological perspective, we observe two powerful mechanisms at play on social media, which can facilitate the implementation of feminist epidemiological research and address so-called anti-feminist bias: social media as a ‘third’ space and the power of groups. Social media posts can be seen as inhabiting a third space, akin to what is said off the record or in-between doors, at the end of a therapy session. Researchers somehow miss the opportunity to use the third spaces that people occupy. Similarly, another existing space that researchers are seldom interested in are peer-groups. Peer-groups are the ideal terrain to generate bottom-up research priorities. To some extent, their on-line versions provide a safe and emancipatory space, accessible, transnational, and inclusive. We would argue that this could bring feminist epidemiology to scale. Conclusion Given the emancipatory power of social media, we propose recommendations and practical implications for leveraging the potential of online-sourced feminist epidemiology at different stages of the research process (from design to dissemination), and for increasing synergies between researchers and the community. We emphasise that attention should be paid to patriarchal sociocultural contexts and power dynamics, the mitigation of risks for political recuperation and stigmatisation, and the co-production of respectful discourse on studied populations.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Benchmarking Crisis in Social Media Analytics: A Solution for the Data-Sharing Problem

Social Science Computer Review ◽

10.1177/08944393211012268 ◽

2021 ◽

pp. 089443932110122

Author(s):

Dennis Assenmacher ◽

Derek Weber ◽

Mike Preuss ◽

André Calero Valdez ◽

Alison Bradshaw ◽

...

Keyword(s):

Social Media ◽

Data Sharing ◽

Algorithm Design ◽

Computational Social Science ◽

Evaluation Framework ◽

Social Media Analytics ◽

Data Sets ◽

The Public ◽

Research Areas ◽

Media Data

Computational social science uses computational and statistical methods in order to evaluate social interaction. The public availability of data sets is thus a necessary precondition for reliable and replicable research. These data allow researchers to benchmark the computational methods they develop, test the generalizability of their findings, and build confidence in their results. When social media data are concerned, data sharing is often restricted for legal or privacy reasons, which makes the comparison of methods and the replicability of research results infeasible. Social media analytics research, consequently, faces an integrity crisis. How is it possible to create trust in computational or statistical analyses, when they cannot be validated by third parties? In this work, we explore this well-known, yet little discussed, problem for social media analytics. We investigate how this problem can be solved by looking at related computational research areas. Moreover, we propose and implement a prototype to address the problem in the form of a new evaluation framework that enables the comparison of algorithms without the need to exchange data directly, while maintaining flexibility for the algorithm design.

Download Full-text

A novel framework for designing a multi-DoF prosthetic wrist control using machine learning

Scientific Reports ◽

10.1038/s41598-021-94449-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chinmay P. Swami ◽

Nicholas Lenhard ◽

Jiyeon Kang

Keyword(s):

Machine Learning ◽

Random Forest ◽

Upper Limb ◽

Daily Living ◽

Machine Learning Algorithms ◽

Data Sets ◽

Random Forest Regression ◽

Prosthetic Devices ◽

Upper Limb Function ◽

The Neural Network

AbstractProsthetic arms can significantly increase the upper limb function of individuals with upper limb loss, however despite the development of various multi-DoF prosthetic arms the rate of prosthesis abandonment is still high. One of the major challenges is to design a multi-DoF controller that has high precision, robustness, and intuitiveness for daily use. The present study demonstrates a novel framework for developing a controller leveraging machine learning algorithms and movement synergies to implement natural control of a 2-DoF prosthetic wrist for activities of daily living (ADL). The data was collected during ADL tasks of ten individuals with a wrist brace emulating the absence of wrist function. Using this data, the neural network classifies the movement and then random forest regression computes the desired velocity of the prosthetic wrist. The models were trained/tested with ADLs where their robustness was tested using cross-validation and holdout data sets. The proposed framework demonstrated high accuracy (F-1 score of 99% for the classifier and Pearson’s correlation of 0.98 for the regression). Additionally, the interpretable nature of random forest regression was used to verify the targeted movement synergies. The present work provides a novel and effective framework to develop an intuitive control for multi-DoF prosthetic devices.

Download Full-text