A Framework Based on K-Means Clustering and Topic Modeling for Analyzing Unstructured Manufacturing Capability Data

Ramin Sabbagh; Farhad Ameri

doi:10.1115/1.4044506

A Framework Based on K-Means Clustering and Topic Modeling for Analyzing Unstructured Manufacturing Capability Data

Journal of Computing and Information Science in Engineering ◽

10.1115/1.4044506 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Ramin Sabbagh ◽

Farhad Ameri

Keyword(s):

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Quantitative Methods ◽

Ad Hoc ◽

Semantic Analysis ◽

Reduction Process ◽

Manufacturing Companies ◽

Manufacturing Capability ◽

N Gram

Abstract The natural language descriptions of the capabilities of manufacturing companies can be found in multiple locations including company websites, legacy system databases, and ad hoc documents and spreadsheets. To unlock the value of unstructured capability data and learn from it, there is a need for developing advanced quantitative methods supported by machine learning and natural language processing techniques. This research proposes a hybrid unsupervised learning methodology using K-means clustering and topic modeling techniques in order to build clusters of suppliers based on their capabilities, automatically infer topics from the created clusters, and discover nontrivial patterns in manufacturing capability corpora. The capability data is extracted either directly from the website of manufacturing firms or from their profiles in e-sourcing portals and directories. Feature extraction and dimensionality reduction process in this work are supported by N-gram extraction and latent semantic analysis (LSA) methods. The proposed clustering method is validated experimentally based on a dataset composed of 150 capability descriptions collected from web-based sourcing directories such as the Thomas Net directory for manufacturing companies. The results of the experiment show that the proposed method creates supplier cluster with high accuracy. Two example applications of the proposed framework, related to supplier similarity measurement and automated thesaurus creation, are introduced in this paper.

Download Full-text

Supplier Clustering Based on Unstructured Manufacturing Capability Data

Volume 1B: 38th Computers and Information in Engineering Conference ◽

10.1115/detc2018-85865 ◽

2018 ◽

Cited By ~ 1

Author(s):

Ramin Sabbagh ◽

Farhad Ameri

Keyword(s):

Natural Language ◽

Language Processing ◽

Quantitative Methods ◽

Ad Hoc ◽

Semantic Analysis ◽

Reduction Process ◽

Manufacturing Companies ◽

Web Based ◽

Manufacturing Capability ◽

Processing Techniques

The descriptions of capabilities of manufacturing companies can be found in multiple locations including company websites, legacy system databases, and ad hoc documents and spreadsheets. The capability descriptions are often represented using natural language. To unlock the value of unstructured capability information and learn from it, there is a need for developing advanced quantitative methods supported by machine learning and natural language processing techniques. This research proposes a multi-step unsupervised learning methodology using K-means clustering and topic modeling techniques in order to build clusters of suppliers based on their capabilities, extract and organize the manufacturing capability terminology, and discover nontrivial patterns in manufacturing capability corpora. The capability data is extracted either directly from the website of manufacturing firms or from their profiles in e-sourcing portals and directories. Feature extraction and dimensionality reduction process in this work in supported by Ngram extraction and Latent Semantic Analysis (LSA) methods. The proposed clustering method is validated experimentally based a dataset composed of 150 capability descriptions collected from web-based sourcing directories such as the Thomas Net directory for manufacturing companies. The results of the experiment show that the proposed method creates supplier cluster with high accuracy.

Download Full-text

Spatio-temporal Semantic Analysis of Safety Production Accidents in Grain Depot based on Natural Language Processing

2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) ◽

10.1109/wiiat50758.2020.00142 ◽

2020 ◽

Author(s):

Xie Wang ◽

Yun Cao ◽

Bo Mao

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Semantic Analysis ◽

Safety Production ◽

Spatio Temporal

Download Full-text

An Evaluation of Patient Safety Event Report Categories Using Unsupervised Topic Modeling

Methods of Information in Medicine ◽

10.3414/me15-01-0010 ◽

2015 ◽

Vol 54 (04) ◽

pp. 338-345 ◽

Cited By ~ 10

Author(s):

A. Fong ◽

R. Ratwani

Keyword(s):

Patient Safety ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Free Text ◽

Event Data ◽

Event Type ◽

Modeling Approach ◽

Safety Event

SummaryObjective: Patient safety event data repositories have the potential to dramatically improve safety if analyzed and leveraged appropriately. These safety event reports often consist of both structured data, such as general event type categories, and unstructured data, such as free text descriptions of the event. Analyzing these data, particularly the rich free text narratives, can be challenging, especially with tens of thousands of reports. To overcome the resource intensive manual review process of the free text descriptions, we demonstrate the effectiveness of using an unsupervised natural language processing approach.Methods: An unsupervised natural language processing technique, called topic modeling, was applied to a large repository of patient safety event data to identify topics, or themes, from the free text descriptions of the data. Entropy measures were used to evaluate and compare these topics to the general event type categories that were originally assigned by the event reporter.Results: Measures of entropy demonstrated that some topics generated from the un-supervised modeling approach aligned with the clinical general event type categories that were originally selected by the individual entering the report. Importantly, several new latent topics emerged that were not originally identified. The new topics provide additional insights into the patient safety event data that would not otherwise easily be detected.Conclusion: The topic modeling approach provides a method to identify topics or themes that may not be immediately apparent and has the potential to allow for automatic reclassification of events that are ambiguously classified by the event reporter.

Download Full-text

Developing and Analyzing a Spanish Corpus for Forensic Purposes

Linguistic Evidence in Security Law and Intelligence ◽

10.5195/lesli.2019.19 ◽

2019 ◽

Vol 3 ◽

Author(s):

Ángela Almela ◽

Gema Alcaraz-Mármol ◽

Arancha García-Pinar ◽

Clara Pallejá

Keyword(s):

Data Collection ◽

Language Processing ◽

Ad Hoc ◽

Semantic Analysis ◽

Linguistic Features ◽

Natural Language Processing Tool ◽

Check Method ◽

Spanish Universities ◽

Gender Based ◽

Main Instrument

In this paper, the methods for developing a database of Spanish writing that can be used for forensic linguistic research are presented, including our data collection procedures. Specifically, the main instrument used for data collection has been translated into Spanish and adapted from Chaski (2001). It consists of ten tasks, by means of which the subjects are asked to write formal and informal texts about different topics. To date, 93 undergraduates from Spanish universities have already participated in the study and prisoners convicted of gender-based abuse have participated. A twofold analysis has been performed, since the data collected have been approached from a semantic and a morphosyntactic perspective. Regarding the semantic analysis, psycholinguistic categories have been used, many of them taken from the LIWC dictionary (Pennebaker et al., 2001). In order to obtain a more comprehensive depiction of the linguistic data, some other ad-hoc categories have been created, based on the corpus itself, using a double-check method for their validation so as to ensure inter-rater reliability. Furthermore, as regards morphosyntactic analysis, the natural language processing tool ALIAS TATTLER is being developed for Spanish. Results shows that is it possible to differentiate non-abusers from abusers with strong accuracy based on linguistic features.

Download Full-text

BUILD KNOWLEDGE GRAPH FROM HETEROGENEOUS DOCUMENTS

Journal of Science and Technology - IUH ◽

10.46242/jst-iuh.v47i05.761 ◽

2021 ◽

Vol 47 (05) ◽

Author(s):

NGUYỄN CHÍ HIẾU

Keyword(s):

Information Retrieval ◽

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Question Answering ◽

Semantic Analysis ◽

Knowledge Graph ◽

Question Answering Systems ◽

Knowledge Graphs

Knowledge Graphs are applied in many fields such as search engines, semantic analysis, and question answering in recent years. However, there are many obstacles for building knowledge graphs as methodologies, data and tools. This paper introduces a novel methodology to build knowledge graph from heterogeneous documents. We use the methodologies of Natural Language Processing and deep learning to build this graph. The knowledge graph can use in Question answering systems and Information retrieval especially in Computing domain

Download Full-text

Forty-two Million Ways to Describe Pain: Topic Modeling of 200,000 PubMed Pain-Related Abstracts Using Natural Language Processing and Deep Learning–Based Text Generation

Pain Medicine ◽

10.1093/pm/pnaa061 ◽

2020 ◽

Vol 21 (11) ◽

pp. 3133-3160

Author(s):

Patrick J Tighe ◽

Bharadwaj Sannapaneni ◽

Roger B Fillingim ◽

Charlie Doyle ◽

Michael Kent ◽

...

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Taxonomic Structure ◽

Generation Model ◽

Pain Research ◽

Embedded Model ◽

Gated Recurrent Units

Abstract Objective Recent efforts to update the definitions and taxonomic structure of concepts related to pain have revealed opportunities to better quantify topics of existing pain research subject areas. Methods Here, we apply basic natural language processing (NLP) analyses on a corpus of >200,000 abstracts published on PubMed under the medical subject heading (MeSH) of “pain” to quantify the topics, content, and themes on pain-related research dating back to the 1940s. Results The most common stemmed terms included “pain” (601,122 occurrences), “patient” (508,064 occurrences), and “studi-” (208,839 occurrences). Contrarily, terms with the highest term frequency–inverse document frequency included “tmd” (6.21), “qol” (6.01), and “endometriosis” (5.94). Using the vector-embedded model of term definitions available via the “word2vec” technique, the most similar terms to “pain” included “discomfort,” “symptom,” and “pain-related.” For the term “acute,” the most similar terms in the word2vec vector space included “nonspecific,” “vaso-occlusive,” and “subacute”; for the term “chronic,” the most similar terms included “persistent,” “longstanding,” and “long-standing.” Topic modeling via Latent Dirichlet analysis identified peak coherence (0.49) at 40 topics. Network analysis of these topic models identified three topics that were outliers from the core cluster, two of which pertained to women’s health and obstetrics and were closely connected to one another, yet considered distant from the third outlier pertaining to age. A deep learning–based gated recurrent units abstract generation model successfully synthesized several unique abstracts with varying levels of believability, with special attention and some confusion at lower temperatures to the roles of placebo in randomized controlled trials. Conclusions Quantitative NLP models of published abstracts pertaining to pain may point to trends and gaps within pain research communities.

Download Full-text

Clinical Report Classification Using Natural Language Processing and Topic Modeling

2012 11th International Conference on Machine Learning and Applications ◽

10.1109/icmla.2012.173 ◽

2012 ◽

Cited By ~ 6

Author(s):

Efsun Sarioglu ◽

Hyeong-Ah Choi ◽

Kabir Yadav

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Clinical Report

Download Full-text

Predicting Verification Methods from Natural Language Requirements

10.31224/osf.io/wxv9e ◽

2020 ◽

Author(s):

Michael Prendergast

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Semantic Analysis ◽

Nearest Neighbor ◽

Large Set ◽

Technical Requirements ◽

Verification Methods ◽

Natural Language Requirements ◽

Processing Algorithms

Abstract – A Verification Cross-Reference Matrix (VCRM) is a table that depicts the verification methods for requirements in a specification. Usually requirement labels are rows, available test methods are columns, and an “X” in a cell indicates usage of a verification method for that requirement. Verification methods include Demonstration, Inspection, Analysis and Test, and sometimes Certification, Similarity and/or Analogy. VCRMs enable acquirers and stakeholders to quickly understand how a product’s requirements will be tested.Maintaining consistency of very large VCRMs can be challenging, and inconsistent verification methods can result in a large set of uncoordinated “spaghetti tests”. Natural language processing algorithms that can identify similarities between requirements offer promise in addressing this challenge.This paper applies and compares compares four natural language processing algorithms to the problem of automatically populating VCRMs from natural language requirements: Naïve Bayesian inference, (b) Nearest Neighbor by weighted Dice similarity, (c) Nearest Neighbor with Latent Semantic Analysis similarity, and (d) an ensemble method combining the first three approaches. The VCRMs used for this study are for slot machine technical requirements derived from gaming regulations from the countries of Australia and New Zealand, the province of Nova Scotia (Canada), the state of Michigan (United States) and recommendations from the International Association of Gaming Regulators (IAGR).

Download Full-text

Mining Numbers in Text: A Survey

10.5772/intechopen.98540 ◽

2021 ◽

Author(s):

Minoru Yoshida ◽

Kenji Kita

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Ad Hoc ◽

Text Data ◽

Research Areas ◽

Recent Growth ◽

Recent Advances ◽

Almost All

Both words and numerals are tokens found in almost all documents but they have different properties. However, relatively little attention has been paid in numerals found in texts and many systems treated the numbers found in the document in ad-hoc ways, such as regarded them as mere strings in the same way as words, normalized them to zeros, or simply ignored them. Recent growth of natural language processing (NLP) research areas has change this situations and more and more attentions have been paid to the numeracy in documents. In this survey, we provide a quick overview of the history and recent advances of the research of mining such relations between numerals and words found in text data.

Download Full-text

Topic Modeling for Keyword Extraction: using Natural Language Processing methods for keyword extraction in Portal Min@s

Revista de Estudos da Linguagem ◽

10.17851/2237-2083.23.3.695-726 ◽

2015 ◽

Vol 23 (3) ◽

pp. 695 ◽

Cited By ~ 1

Author(s):

Arnaldo Candido Junior ◽

Célia Magalhães ◽

Helena Caseli ◽

Régis Zangirolami

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Topic Modeling ◽

Latent Dirichlet Allocation ◽

Keyword Extraction ◽

Processing Methods ◽

Dirichlet Allocation

Este artigo tem o objetivo da avaliar a aplicação de dois métodos automáticos eficientes na extração de palavras-chave, usados pelas comunidades da Linguística de Corpus e do Processamento da Língua Natural para gerar palavras-chave de textos literários: o WordSmith Tools e o Latent Dirichlet Allocation (LDA). As duas ferramentas escolhidas para este trabalho têm suas especificidades e técnicas diferentes de extração, o que nos levou a uma análise orientada para a sua performance. Objetivamos entender, então, como cada método funciona e avaliar sua aplicação em textos literários. Para esse fim, usamos análise humana, com conhecimento do campo dos textos usados. O método LDA foi usado para extrair palavras-chave por meio de sua integração com o Portal Min@s: Corpora de Fala e Escrita, um sistema geral de processamento de corpora, concebido para diferentes pesquisas de Linguística de Corpus. Os resultados do experimento confirmam a eficácia do WordSmith Tools e do LDA na extração de palavras-chave de um corpus literário, além de apontar que é necessária a análise humana das listas em um estágio anterior aos experimentos para complementar a lista gerada automaticamente, cruzando os resultados do WordSmith Tools e do LDA. Também indicam que a intuição linguística do analista humano sobre as listas geradas separadamente pelos dois métodos usados neste estudo foi mais favorável ao uso da lista de palavras-chave do WordSmith Tools.

Download Full-text