Who is Tweeting? A Scoping Review of Methods to Establish Race and Ethnicity from Twitter Datasets (Preprint)

2021 ◽  
Author(s):  
Su Golder ◽  
Robin Stevens ◽  
Karen O'Conor ◽  
Richard James ◽  
Graciela Gonzalez-Hernandez

BACKGROUND Background: A growing amount of health research uses social media data. Those critical of social media research often cite that it may be unrepresentative of the population, but the suitability of social media data in digital epidemiology is more nuanced. Identifying the demographics of social media users can help establish representativeness. OBJECTIVE Objectives: We sought to identify the different approaches or combination of approaches to extract race or ethnicity from social media and report on the challenges of using these methods. METHODS Methods: We present a scoping review to identify the methods used to extract race or ethnicity from Twitter datasets. We searched 17 electronic databases and carried out reference checking and handsearching in order to identify relevant articles. Sifting of each record was undertaken independently by at least two researchers with any disagreement discussed. The included studies could be categorized by the methods the authors applied to extract race or ethnicity. RESULTS Results: From 1249 records we identified 67 that met our inclusion criteria. The majority focus on US based users and English language tweets. A range of types of data were used including Twitter profile -pictures or information from bios (such as names or self-declarations), or location and/or content in the tweets themselves. A range of methodologies were used including using manual inference, linkage to census data, commercial software, language/dialect recognition and machine learning. Not all studies evaluated their methods. Those that did found accuracy to vary from 45% to 93% with significantly lower accuracy identifying non-white race categories. The inference of race/ethnicity raises important ethical questions which can be exacerbated by the data and methods used. The comparative accuracy of different methods is also largely unknown. CONCLUSIONS Conclusion: There is no standard accepted approach or current guidelines for extracting or inferring race or ethnicity of Twitter users. Social media researchers must use careful interpretation of race or ethnicity and not over-promise what can be achieved, as even manual screening is a subjective, imperfect method. Future research should establish the accuracy of methods to inform evidence-based best practice guidelines for social media researchers, and be guided by concerns of equity and social justice.

2021 ◽  
Author(s):  
su golder ◽  
Robin Stevens ◽  
Karen O'Connor ◽  
Richard James ◽  
Graciela Gonzalez-Hernandez

Background: A growing amount of health research uses social media data. Those critical of social media research often cite that it may be unrepresentative of the population. Identifying the demographics of social media users enables us to measure the representativeness. Extracting race or ethnicity from social media data can be difficult and researchers may choose from a multitude of different approaches. Methods: We present a scoping review to identify the methods used to extract race or ethnicity from Twitter datasets. We searched 16 electronic databases and carried out reference checking in order to identify relevant articles. Sifting of each record was undertaken independently by at least two researchers with any disagreement discussed. The research could be grouped by the methods applied to extract race or ethnicity.Results: From 1093 records we identified 56 that met our inclusion criteria. The majority focus on Twitter users based in the US. A range of types of data were used including Twitter profile -pictures, bios, and/or location, and the content in the tweets themselves. The methods used were wide ranging and included using manual inference, linkage to census data, commercial software, language/dialect recognition and machine learning. Not all studies evaluated their methods. Those that did found accuracy to vary from 45% to 93% with significantly lower accuracy identifying non-white race categories. There may be some ethical questions over some of the methods used, particularly using photos or dialect, as well as questions surrounding accuracy.Conclusion: There is no standard approach or guidelines for extracting race or ethnicity from Twitter or other social media. Social media researchers must use careful interpretation of race or ethnicity and not over-promise what can be achieved, as even manual screening is a subjective, imperfect method. Future research should establish the accuracy of methods to inform evidence-based best practice guidelines for social media researchers, and be guided by concerns of equity and social justice.


2015 ◽  
Vol 23 (3) ◽  
pp. 644-648 ◽  
Author(s):  
Hopin Lee ◽  
James H McAuley ◽  
Markus Hübscher ◽  
Heidi G Allen ◽  
Steven J Kamper ◽  
...  

Background Back pain is a global health problem. Recent research has shown that risk factors that are proximal to the onset of back pain might be important targets for preventive interventions. Rapid communication through social media might be useful for delivering timely interventions that target proximal risk factors. Identifying individuals who are likely to discuss back pain on Twitter could provide useful information to guide online interventions. Methods We used a case-crossover study design for a sample of 742 028 tweets about back pain to quantify the risks associated with a new tweet about back pain. Results The odds of tweeting about back pain just after tweeting about selected physical, psychological, and general health factors were 1.83 (95% confidence interval [CI], 1.80-1.85), 1.85 (95% CI: 1.83-1.88), and 1.29 (95% CI, 1.27-1.30), respectively. Conclusion These findings give directions for future research that could use social media for innovative public health interventions.


BMJ Open ◽  
2018 ◽  
Vol 8 (12) ◽  
pp. e022931 ◽  
Author(s):  
Joanna Taylor ◽  
Claudia Pagliari

IntroductionThe rising popularity of social media, since their inception around 20 years ago, has been echoed in the growth of health-related research using data derived from them. This has created a demand for literature reviews to synthesise this emerging evidence base and inform future activities. Existing reviews tend to be narrow in scope, with limited consideration of the different types of data, analytical methods and ethical issues involved. There has also been a tendency for research to be siloed within different academic communities (eg, computer science, public health), hindering knowledge translation. To address these limitations, we will undertake a comprehensive scoping review, to systematically capture the broad corpus of published, health-related research based on social media data. Here, we present the review protocol and the pilot analyses used to inform it.MethodsA version of Arksey and O’Malley’s five-stage scoping review framework will be followed: (1) identifying the research question; (2) identifying the relevant literature; (3) selecting the studies; (4) charting the data and (5) collating, summarising and reporting the results. To inform the search strategy, we developed an inclusive list of keyword combinations related to social media, health and relevant methodologies. The frequency and variability of terms were charted over time and cross referenced with significant events, such as the advent of Twitter. Five leading health, informatics, business and cross-disciplinary databases will be searched: PubMed, Scopus, Association of Computer Machinery, Institute of Electrical and Electronics Engineers and Applied Social Sciences Index and Abstracts, alongside the Google search engine. There will be no restriction by date.Ethics and disseminationThe review focuses on published research in the public domain therefore no ethics approval is required. The completed review will be submitted for publication to a peer-reviewed, interdisciplinary open access journal, and conferences on public health and digital research.


2021 ◽  
Author(s):  
Nick Boettcher

BACKGROUND The study of depression and anxiety using publicly available social media data is a research activity that has grown considerably over the last decade. The discussion platform Reddit has become a popular social media data source in this nascent area of study, in part because of the unique ways in which the platform is facilitative of research. To date, no work has been done to synthesize existing studies of depression and anxiety using Reddit. OBJECTIVE The objective of this review is to understand the scope and nature of research using Reddit as a primary data source for studying depression and anxiety. METHODS A scoping review was conducted using the Arksey and O’Malley framework. Academic databases searched include MEDLINE/PubMed, EMBASE, CINAHL, PsycINFO, PsycARTICLES, Scopus, ScienceDirect, IEEE Xplore, and ACM database. Inclusion criteria were developed using the Participants/Concept/Context framework outlined by the Joanna Briggs Institute Scoping Review Methodology Group. Eligible studies featured a methodological focus on analyzing depression and/or anxiety using naturalistic written expressions from Reddit users as the primary data source. RESULTS 54 Studies were included for review. Tables and corresponding analysis delineate key methodological features including a comparatively larger focus on depression versus anxiety, an even split of original and premade datasets, a favored analytic focus on classifying the mental health states of Reddit users, and practical implications often recommending new methods of professionally-driven mental health monitoring and outreach for Reddit users. CONCLUSIONS Studies of depression and anxiety using Reddit data are currently driven by a prevailing methodology which favors a technical, solution-based orientation. Researchers interested in advancing this research area will benefit from further consideration of conceptual issues surrounding interpretation of Reddit data with the medical model of mental health. Further efforts are also needed to locate accountability and autonomy within practice implications suggesting new forms of engagement with Reddit users.


Author(s):  
Yonghong Tong ◽  
Muhammet Bakan

With the increasing application of using mobile device and social media, large amount of continuous information about human behaviors is available. Data visualization provides an insightful presentation for the large-scale social media datasets. The focus of this paper is on the development of a mobile-device based visualization and analysis platform for social media data for the purpose of retrieving and visualizing visitors’ information for a specific region. This developed platform allows users to view the “big picture” of the visitors’ locations information. The result shows that the developed platform 1) performs a satisfied data collection and data visualization on a mobile device, 2) assists users to understand the varieties of human behaviors while visiting a place, and 3) offers a feasible role in imaging immediate information from social media and leading to further policy-making in related sectors and areas. Future research opportunities and challenges for social media data visualization are discussed.Keywords: Social media, data visualization, mobile device


2021 ◽  
Vol 8 (1) ◽  
pp. 205395172110103
Author(s):  
Sabina Leonelli ◽  
Rebecca Lovell ◽  
Benedict W Wheeler ◽  
Lora Fleming ◽  
Hywel Williams

The paper problematises the reliability and ethics of using social media data, such as sourced from Twitter or Instagram, to carry out health-related research. As in many other domains, the opportunity to mine social media for information has been hailed as transformative for research on well-being and disease. Considerations around the fairness, responsibilities and accountabilities relating to using such data have often been set aside, on the understanding that as long as data were anonymised, no real ethical or scientific issue would arise. We first counter this perception by emphasising that the use of social media data in health research can yield problematic and unethical results. We then provide a conceptualisation of methodological data fairness that can complement data management principles such as FAIR by enhancing the actionability of social media data for future research. We highlight the forms that methodological data fairness can take at different stages of the research process and identify practical steps through which researchers can ensure that their practices and outcomes are scientifically sound as well as fair to society at large. We conclude that making research data fair as well as FAIR is inextricably linked to concerns around the adequacy of data practices. The failure to act on those concerns raises serious ethical, methodological and epistemic issues with the knowledge and evidence that are being produced.


2021 ◽  
Author(s):  
Ming Yi Tan ◽  
Charlene Enhui Goh ◽  
Hee Hon Tan

BACKGROUND Pain description is fundamental to health care. The McGill Pain Questionnaire (MPQ) has been validated as a tool for the multidimensional measurement of pain; however, its use relies heavily on language proficiency. Although the MPQ has remained unchanged since its inception, the English language has evolved significantly since then. The advent of the internet and social media has allowed for the generation of a staggering amount of publicly available data, allowing linguistic analysis at a scale never seen before. OBJECTIVE The aim of this study is to use social media data to examine the relevance of pain descriptors from the existing MPQ, identify novel contemporary English descriptors for pain among users of social media, and suggest a modification for a new MPQ for future validation and testing. METHODS All posts from social media platforms from January 1, 2019, to December 31, 2019, were extracted. Artificial intelligence and emotion analytics algorithms (Crystalace and CrystalFeel) were used to measure the emotional properties of the text, including <i>sarcasm</i>, <i>anger</i>, <i>fear</i>, <i>sadness</i>, <i>joy</i>, and <i>valence</i>. Word2Vec was used to identify new pain descriptors associated with the original descriptors from the MPQ. Analysis of count and pain intensity formed the basis for proposing new pain descriptors and determining the order of pain descriptors within each subclass. RESULTS A total of 118 new associated words were found via Word2Vec. Of these 118 words, 49 (41.5%) words had a count of at least 110, which corresponded to the count of the bottom 10% (8/78) of the original MPQ pain descriptors. The count and intensity of pain descriptors were used to formulate the inclusion criteria for a new pain questionnaire. For the suggested new pain questionnaire, 11 existing pain descriptors were removed, 13 new descriptors were added to existing subclasses, and a new <i>Psychological</i> subclass comprising 9 descriptors was added. CONCLUSIONS This study presents a novel methodology using social media data to identify new pain descriptors and can be repeated at regular intervals to ensure the relevance of pain questionnaires. The original MPQ contains several potentially outdated pain descriptors and is inadequate for reporting the psychological aspects of pain. Further research is needed to examine the reliability and validity of the revised MPQ.


2020 ◽  
Vol 5 ◽  
pp. 44
Author(s):  
Nina H. Di Cara ◽  
Andy Boyd ◽  
Alastair R. Tanner ◽  
Tarek Al Baghal ◽  
Lisa Calderwood ◽  
...  

Background: Cohort studies gather huge volumes of information about a range of phenotypes but new sources of information such as social media data are yet to be integrated. Participant’s long-term engagement with cohort studies, as well as the potential for their social media data to be linked to other longitudinal data, could provide novel advances but may also give participants a unique perspective on the acceptability of this growing research area. Methods: Two focus groups explored participant views towards the acceptability and best practice for the collection of social media data for research purposes. Participants were drawn from the Avon Longitudinal Study of Parents and Children cohort; individuals from the index cohort of young people (N=9) and from the parent generation (N=5) took part in two separate 90-minute focus groups. The discussions were audio recorded and subjected to qualitative analysis. Results: Participants were generally supportive of the collection of social media data to facilitate health and social research. They felt that their trust in the cohort study would encourage them to do so. Concern was expressed about the collection of data from friends or connections who had not consented. In terms of best practice for collecting the data, participants generally preferred the use of anonymous data derived from social media to be shared with researchers. Conclusion: Cohort studies have trusting relationships with their participants; for this relationship to extend to linking their social media data with longitudinal information, procedural safeguards are needed. Participants understand the goals and potential of research integrating social media data into cohort studies, but further research is required on the acquisition of their friend’s data. The views gathered from participants provide important guidance for future work seeking to integrate social media in cohort studies.


2020 ◽  
Author(s):  
Benjamin Lucas ◽  
Liana Bravo-Balsa ◽  
Vicky Brotherton ◽  
Nicola Wright ◽  
Todd Landman

In this working paper, we investigate high-level changes in the online strategic communications of organizations engaged with SDG 8.7 (ending modern slavery) during the COVID-19 crisis. We present preliminary evidence of important semantic and thematic shifts based on data from Twitter during this time, with an emphasis on developing the SOLACE (Social Listening and Communications Engagement) dashboard, and with recommendations for important future research involving the use of social media data as a basis for distilling organizational-agenda proxies based on digital campaigns and activism during times of crisis.


2019 ◽  
Vol 5 (1) ◽  
pp. 205630511983458
Author(s):  
Yan Wang ◽  
Wenchao Yu ◽  
Sam Liu ◽  
Sean D. Young

Crime monitoring tools are needed for public health and law enforcement officials to deploy appropriate resources and develop targeted interventions. Social media, such as Twitter, has been shown to be a feasible tool for monitoring and predicting public health events such as disease outbreaks. Social media might also serve as a feasible tool for crime surveillance. In this study, we collected Twitter data between May and December 2012 and crime data for the years 2012 and 2013 in the United States. We examined the association between crime data and drug-related tweets. We found that tweets from 2012 were strongly associated with county-level crime data in both 2012 and 2013. This study presents preliminary evidence that social media data can be used to help predict future crimes. We discuss how future research can build upon this initial study to further examine the feasibility and effectiveness of this approach.


Sign in / Sign up

Export Citation Format

Share Document