Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health

Data Lake Ecosystem Workflow

10.21079/11681/40203 ◽

2021 ◽

Author(s):

R. Salter ◽

Quyen Dong ◽

Cody Coleman ◽

Maria Seale ◽

Alicia Ruvinsky ◽

...

Keyword(s):

Big Data ◽

Language Processing ◽

Data Analytics ◽

Large Scale ◽

Big Data Analytics ◽

Lake Ecosystem ◽

Data Governance ◽

Government Organizations ◽

Large Scale Data ◽

Scale Data

The Engineer Research and Development Center, Information Technology Laboratory’s (ERDC-ITL’s) Big Data Analytics team specializes in the analysis of large-scale datasets with capabilities across four research areas that require vast amounts of data to inform and drive analysis: large-scale data governance, deep learning and machine learning, natural language processing, and automated data labeling. Unfortunately, data transfer between government organizations is a complex and time-consuming process requiring coordination of multiple parties across multiple offices and organizations. Past successes in large-scale data analytics have placed a significant demand on ERDC-ITL researchers, highlighting that few individuals fully understand how to successfully transfer data between government organizations; future project success therefore depends on a small group of individuals to efficiently execute a complicated process. The Big Data Analytics team set out to develop a standardized workflow for the transfer of large-scale datasets to ERDC-ITL, in part to educate peers and future collaborators on the process required to transfer datasets between government organizations. Researchers also aim to increase workflow efficiency while protecting data integrity. This report provides an overview of the created Data Lake Ecosystem Workflow by focusing on the six phases required to efficiently transfer large datasets to supercomputing resources located at ERDC-ITL.

Download Full-text

A Natural Language Processing Tool for Large-Scale Data Extraction from Echocardiography Reports

PLoS ONE ◽

10.1371/journal.pone.0153749 ◽

2016 ◽

Vol 11 (4) ◽

pp. e0153749 ◽

Cited By ~ 20

Author(s):

Chinmoy Nath ◽

Mazen S. Albaghdadi ◽

Siddhartha R. Jonnalagadda

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Data Extraction ◽

Large Scale Data ◽

Natural Language Processing Tool ◽

Scale Data

Download Full-text

Natural Language Processing of Medical Alert Service Notes Reveals Reasons for Emergency Admissions

Iproceedings ◽

10.2196/15225 ◽

2019 ◽

Vol 5 (1) ◽

pp. e15225

Author(s):

Felipe Masculo ◽

Jorn op den Buijs ◽

Mariana Simons ◽

Aki Harma

Keyword(s):

Deep Learning ◽

Chest Pain ◽

Language Processing ◽

Large Scale ◽

Emergency Situation ◽

Scale Analysis ◽

Vector Representation ◽

Large Scale Analysis ◽

Structured Information ◽

Insight Into

Background A Personal Emergency Response Service (PERS) enables an aging population to receive help quickly when an emergency situation occurs. The reasons that trigger a PERS alert are varied, including a sudden worsening of a chronic condition, a fall, or other injury. Every PERS case is documented by the response center using a combination of structured variables and free text notes. The text notes, in particular, contain a wealth of information in case of an incident such as contextual information, details about the situation, symptoms and more. Analysis of these notes at a population level could provide insight into the various situations that cause PERS medical alerts. Objective The objectives of this study were to (1) develop methods to enable the large-scale analysis of text notes from a PERS response center, and (2) to apply these methods to a large dataset and gain insight into the different situations that cause medical alerts. Methods More than 2.5 million deidentified PERS case text notes were used to train a document embedding model (ie, a deep learning Recurrent Neural Network [RNN] that takes the medical alert text note as input and produces a corresponding fixed length vector representation as output). We applied this model to 100,000 PERS text notes related to medical incidents that resulted in emergency department admission. Finally, we used t-SNE, a nonlinear dimensionality reduction method, to visualize the vector representation of the text notes in 2D as part of a graphical user interface that enabled interactive exploration of the dataset and visual analytics. Results Visual analysis of the vectors revealed the existence of several well-separated clusters of incidents such as fall, stroke/numbness, seizure, breathing problems, chest pain, and nausea, each of them related to the emergency situation encountered by the patient as recorded in an existing structured variable. In addition, subclusters were identified within each cluster which grouped cases based on additional features extracted from the PERS text notes and not available in the existing structured variables. For example, the incidents labeled as falls (n=37,842) were split into several subclusters corresponding to falls with bone fracture (n=1437), falls with bleeding (n=4137), falls caused by dizziness (n=519), etc. Conclusions The combination of state-of-the-art natural language processing, deep learning, and visualization techniques enables the large-scale analysis of medical alert text notes. This analysis demonstrates that, in addition to falls alerts, the PERS service is broadly used to signal for help in situations often related to underlying chronic conditions and acute symptoms such as respiratory distress, chest pain, diabetic reaction, etc. Moreover, the proposed techniques enable the extraction of structured information related to the medical alert from unstructured text with minimal human supervision. This structured information could be used, for example, to track trends over time, to generate concise medical alert summaries, and to create predictive models for desired outcomes.

Download Full-text

Comparing serotyping with whole-genome sequencing for subtyping of non-typhoidal Salmonella enterica: a large-scale analysis of 37 serotypes with a public health impact in the USA

Microbial Genomics ◽

10.1099/mgen.0.000425 ◽

2020 ◽

Vol 6 (9) ◽

Cited By ~ 1

Author(s):

Ehud Elnekave ◽

Samuel L. Hong ◽

Seunghyun Lim ◽

Timothy J. Johnson ◽

Andres Perez ◽

...

Keyword(s):

Public Health ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Health Impact ◽

Public Health Impact ◽

Whole Genome ◽

Scale Analysis ◽

Large Scale Analysis ◽

The Usa

Serotyping has traditionally been used for subtyping of non-typhoidal Salmonella (NTS) isolates. However, its discriminatory power is limited, which impairs its use for epidemiological investigations of source attribution. Whole-genome sequencing (WGS) analysis allows more accurate subtyping of strains. However, because of the relative newness and cost of routine WGS, large-scale studies involving NTS WGS are still rare. We aimed to revisit the big picture of subtyping NTS with a public health impact by using traditional serotyping (i.e. reaction between antisera and surface antigens) and comparing the results with those obtained using WGS. For this purpose, we analysed 18 282 sequences of isolates belonging to 37 serotypes with a public health impact that were recovered in the USA between 2006 and 2017 from multiple sources, and were available at the National Center for Biotechnology Information (NCBI). Phylogenetic trees were reconstructed for each serotype using the core genome for the identification of genetic subpopulations. We demonstrated that WGS-based subtyping allows better identification of sources potentially linked with human infection and emerging subpopulations, along with providing information on the risk of dissemination of plasmids and acquired antimicrobial resistance genes (AARGs). In addition, by reconstructing a phylogenetic tree with representative isolates from all serotypes (n=370), we demonstrated genetic variability within and between serotypes, which formed monophyletic, polyphyletic and paraphyletic clades. Moreover, we found (in the entire data set) an increased detection rate for AARGs linked to key antimicrobials (such as quinolones and extended-spectrum cephalosporins) over time. The outputs of this large-scale analysis reveal new insights into the genetic diversity within and between serotypes; the polyphyly and paraphyly of certain serotypes may suggest that the subtyping of NTS to serotypes may not be sufficient. Moreover, the results and the methods presented here, leading to differentiation between genetic subpopulations based on their potential risk to public health, as well as narrowing down the possible sources of these infections, may be used as a baseline for subtyping of future NTS infections and help efforts to mitigate and prevent infections in the USA and globally.

Download Full-text

Evaluating the healthiness of chain-restaurant menu items using crowdsourcing: a new method

Public Health Nutrition ◽

10.1017/s1368980016001804 ◽

2016 ◽

Vol 20 (1) ◽

pp. 18-24 ◽

Cited By ~ 3

Author(s):

Lenard I Lesser ◽

Leslie Wu ◽

Timothy B Matthiessen ◽

Harold S Luft

Keyword(s):

Public Health ◽

Nutritional Quality ◽

Large Scale ◽

Registered Dietitian ◽

Food Items ◽

Large Scale Data ◽

Nutrient Profiling ◽

The Cost ◽

Scale Data

AbstractObjectiveTo develop a technology-based method for evaluating the nutritional quality of chain-restaurant menus to increase the efficiency and lower the cost of large-scale data analysis of food items.DesignUsing a Modified Nutrient Profiling Index (MNPI), we assessed chain-restaurant items from the MenuStat database with a process involving three steps: (i) testing ‘extreme’ scores; (ii) crowdsourcing to analyse fruit, nut and vegetable (FNV) amounts; and (iii) analysis of the ambiguous items by a registered dietitian.ResultsIn applying the approach to assess 22 422 foods, only 3566 could not be scored automatically based on MenuStat data and required further evaluation to determine healthiness. Items for which there was low agreement between trusted crowd workers, or where the FNV amount was estimated to be >40 %, were sent to a registered dietitian. Crowdsourcing was able to evaluate 3199, leaving only 367 to be reviewed by the registered dietitian. Overall, 7 % of items were categorized as healthy. The healthiest category was soups (26 % healthy), while desserts were the least healthy (2 % healthy).ConclusionsAn algorithm incorporating crowdsourcing and a dietitian can quickly and efficiently analyse restaurant menus, allowing public health researchers to analyse the healthiness of menu items.

Download Full-text

Best Paper Selection

Yearbook of Medical Informatics ◽

10.1055/s-0038-1641129 ◽

2017 ◽

Vol 26 (01) ◽

pp. e21-e22

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Clinical Evidence ◽

Scale Analysis ◽

Clinical Text ◽

Large Scale Analysis ◽

Open Source Framework

Althoff, T, Clark K, Leskovec, J. Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health. Trans Assoc Comput Linguist 2016(4):463-76 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5361062/ Kilicoglu, H, Demner-Fushman, D. Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text. PLoS One. 2016 Mar 2;11(3):e0148538 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148538 Morid, MA, Fiszman, M, Raja, K, Jonnalagadda, SR, Del Fiol, G. Classification of clinically useful sentences in clinical evidence resources. J Biomed Inform. 2016 Apr;60:14-22 http://www.sciencedirect.com/science/article/pii/S1532046416000046?via%3Dihub Shivade C, de Marneffe MC, Fosler-Lussier E, Lai AM. Identification, characterization, and grounding of gradable terms in clinical text. Proceedings of the 15th Workshop on Biomedical Natural Language Processing. 2016:17-26 https://www.semanticscholar.org/paper/Identification-characterization-and-grounding-of-g-Shivade-Marneffe/c00ba120de1964b444807255030741d199ba6e04 Wu, Y, Denny, JC, Rosenbloom, ST, Miller, RA, Giuse, DA, Wang, L, Blanquicett, C, Soysal, E, Xu, J, Xu, H. A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD). J Am Med Inform Assoc 2017 Apr 1;24(e1):e79-e86 https://academic.oup.com/jamia/article-abstract/24/e1/e79/2631496/A-long-journey-to-short-abbreviations-developing?redirectedFrom=fulltext

Download Full-text

Glyfn: A Glyph-Aware Fusion Network for Distributed Chinese Event Detection

10.5121/csit.2021.110114 ◽

2021 ◽

Author(s):

Qi Zhai ◽

Zhigang Kan ◽

Linhui Feng ◽

Linbo Qiao ◽

Feng Liu

Keyword(s):

Event Detection ◽

Large Scale ◽

State Of The Art ◽

Language Model ◽

Special Kind ◽

Detection Task ◽

Experimental Results ◽

Large Scale Data ◽

Unstructured Text ◽

Scale Data

Recently, Chinese event detection has attracted more and more attention. As a special kind of hieroglyphics, Chinese glyphs are semantically useful but still unexplored in this task. In this paper, we propose a novel Glyph-Aware Fusion Network, named GlyFN. It introduces the glyphs' information into the pre-trained language model representation. To obtain a better representation, we design a Vector Linear Fusion mechanism to fuse them. Specifically, it first utilizes a max-pooling to capture salient information. Then, we use the linear operation of vectors to retain unique information. Moreover, for large-scale unstructured text, we distribute the data into different clusters parallelly. Finally, we conduct extensive experiments on ACE2005 and large-scale data. Experimental results show that GlyFN obtains increases of 7.48(10.18%) and 6.17(8.7%) in the F1-score for trigger identification and classification over the state-of-the-art methods, respectively. Furthermore, the event detection task for large-scale unstructured text can be efficiently accomplished through distribution.

Download Full-text

Deep Context: A Neural Language Model for Large-scale Networked Documents

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/431 ◽

2017 ◽

Cited By ~ 1

Author(s):

Hao Wu ◽

Kristina Lerman

Keyword(s):

Word Order ◽

Link Prediction ◽

Large Scale ◽

Language Model ◽

Distributed Representations ◽

Large Scale Data ◽

Context Vector ◽

Effectiveness And Efficiency ◽

Data Collections ◽

Scale Data

We propose a scalable neural language model that leverages the links between documents to learn the deep context of documents. Our model, Deep Context Vector, takes advantage of distributed representations to exploit the word order in document sentences, as well as the semantic connections among linked documents in a document network. We evaluate our model on large-scale data collections that include Wikipedia pages, and scientific and legal citations networks. We demonstrate its effectiveness and efficiency on document classification and link prediction tasks.

Download Full-text

Best Paper Selection

Yearbook of Medical Informatics ◽

10.1055/s-0037-1606508 ◽

2017 ◽

Vol 26 (01) ◽

pp. 233-234

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Clinical Evidence ◽

Scale Analysis ◽

Clinical Text ◽

Large Scale Analysis ◽

Open Source Framework

Althoff, T, Clark K, Leskovec, J. Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health. Trans Assoc Comput Linguist 2016(4):463-76 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5361062/ Kilicoglu, H, Demner-Fushman, D. Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text. PLoS One. 2016 Mar 2;11(3):e0148538 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148538 Morid, MA, Fiszman, M, Raja, K, Jonnalagadda, SR, Del Fiol, G. Classification of clinically useful sentences in clinical evidence resources. J Biomed Inform. 2016 Apr;60:14-22 http://www.sciencedirect.com/science/article/pii/S1532046416000046?via%3Dihub Shivade C, de Marneffe MC, Fosler-Lussier E, Lai AM. Identification, characterization, and grounding of gradable terms in clinical text. Proceedings of the 15th Workshop on Biomedical Natural Language Processing. 2016:17-26 https://www.semanticscholar.org/paper/Identification-characterization-and-grounding-of-g-Shivade-Marneffe/c00ba120de1964b444807255030741d199ba6e04 Wu, Y, Denny, JC, Rosenbloom, ST, Miller, RA, Giuse, DA, Wang, L, Blanquicett, C, Soysal, E, Xu, J, Xu, H. A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD). J Am Med Inform Assoc 2017 Apr 1;24(e1):e79-e86 https://academic.oup.com/jamia/article-abstract/24/e1/e79/2631496/A-long-journey-to-short-abbreviations-developing?redirectedFrom=fulltext

Download Full-text

Measuring happiness of large-scaled online Turkish unstructured data (Preprint)

10.2196/preprints.24037 ◽

2020 ◽

Author(s):

Esra Kahya Özyirmidokuz ◽

Kumru Uyar ◽

Raian Ali ◽

Eduard Alexandru Stoica ◽

Betül Karakaş

Keyword(s):

Social Networks ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Well Being ◽

Emotional Awareness ◽

Large Scale Data ◽

Processing Algorithms ◽

Scale Data

BACKGROUND Measuring online Turkish happiness requires a Turkish happiness dictionary which could reflect norms and social values more culturally and linguistically instead of using a translation-oriented method. Analyzing data without neglecting cultural characteristics will not be reliable. Turkish translation of an English word in the Affective Norms of English Words (ANEW) dictionary does not express the same feeling of a Turkish word. In addition, existing emotional dictionaries are not developed for specifically for the social networks with emoticons. OBJECTIVE This research presents the Turkish Happiness Index (THI) which is a set of psychological normative happiness scores to measure an average level of happiness of Turkish online unstructured large-scale data. A well-being informatics analytics research is also done by using THI. METHODS Turkish Happiness Index was completely generated on social networks. 20000 words were extracted with web text mining from social networks. Natural Language Processing algorithms were applied. After data reduction quantitative research methodology is applied. The happiness scores were based detected based on 667 participants’ subjective happiness levels and their thoughts about the 1874 Turkish words. Alexithymia scale was also used to identify the emotional awareness of the participants. The evaluations of the words were done in the dimension of valence using the Self-Assessment Manikin in an online platform. NLP was used to measure online Turkish happiness of data. Data was collected from Facebook with negative #war and positive #family hashtags in a duration of one month using a 3rd party software tool. Natural language processing algorithms including tokenization, transformation, filtering and stemming after converting data to documents. The happiness levels of the documents based on hashtags were determined using the Turkish Happiness Index dictionary. RESULTS THI which contains 345 words and their happiness scores in the Turkish language was developed. The THI is given in Appendix 1. We also put a comparison between words of dictionaries to understand the cultural differences. CONCLUSIONS THI provide researchers with standard materials through which they can automatically measure online happiness of Turkish large-scale data. THI can be used in in real-time big data analytics.

Download Full-text