scholarly journals Unsupervised Text Feature Extraction for Academic Chatbot using Constrained FP-Growth

2021 ◽  
Vol 14 ◽  
pp. 1-11
Author(s):  
Suraya Alias

In the edge where conversation merely involves online chatting and texting one another, an automated conversational agent is needed to support certain repetitive tasks such as providing FAQs, customer service and product recommendations. One of the key challenges is to identify and discover user’s intention in a social conversation where the focus of our work in the academic domain. Our unsupervised text feature extraction method for Intent Pattern Discovery is developed by applying text features constraints to the FP-Growth technique. The academic corpus was developed using a chat messages dataset where the conversation between students and academicians regarding undergraduate and postgraduate queries were extracted as text features for our model. We experimented with our new Constrained Frequent Intent Pattern (cFIP) model in contrast with the N-gram model in terms of feature-vector size reduction, descriptive intent discovery, and analysis of cFIP Rules. Our findings show significant and descriptive intent patterns was discovered with confidence rules value of 0.9 for cFIP of 3-sequence. We report an average feature-vector size reduction of 76% compared to the Bigram model using both undergraduate and postgraduate conversation datasets. The usability testing results depicted overall user satisfaction average mean score is 4.30 out of 5 in using the Academic chatbot which supported our intent discovery cFIP approach.

2014 ◽  
Vol 1046 ◽  
pp. 444-448 ◽  
Author(s):  
Lu Chen ◽  
Tao Zhang ◽  
Yuan Yuan Ma ◽  
Cheng Zhou

With the rapid development of Internet technology and information technology, the emergence of a large number of document data, text classification techniques for handling massive amounts of data is becoming increasingly important. This paper presents a distributed text feature extraction method based on distributed computing model—MapReduce. In the process of mass text processing, solve the problem of processing text size limit and inadequate performance, provide the research of text feature extraction method a new way of thinking.


2014 ◽  
Vol 568-570 ◽  
pp. 668-671
Author(s):  
Yi Long ◽  
Fu Rong Liu ◽  
Guo Qing Qiu

To address the problem that the dimension of the feature vector extracted by Local Binary Pattern (LBP) for face recognition is too high and Principal Component Analysis (PCA) extract features are not the best classification features, an efficient feature extraction method using LBP, PCA and Maximum scatter difference (MSD) has been introduced in this paper. The original face image is firstly divided into sub-images, then the LBP operator is applied to extract the histogram feature. and the feature dimensions are further reduced by using PCA. Finally,MSD is performed on the reduced PCA-based feature.The experimental results on ORL and Yale database demonstrate that the proposed method can classify more effectively and can get higher recognition rate than the traditional recognition methods.


2019 ◽  
Vol 2019 ◽  
pp. 1-8 ◽  
Author(s):  
Yuntao Zhao ◽  
Bo Bo ◽  
Yongxin Feng ◽  
ChunYu Xu ◽  
Bo Yu

With explosive growth of malware, Internet users face enormous threats from Cyberspace, known as “fifth dimensional space.” Meanwhile, the continuous sophisticated metamorphism of malware such as polymorphism and obfuscation makes it more difficult to detect malicious behavior. In the paper, based on the dynamic feature analysis of malware, a novel feature extraction method of hybrid gram (H-gram) with cross entropy of continuous overlapping subsequences is proposed, which implements semantic segmentation of a sequence of API calls or instructions. The experimental results show the H-gram method can distinguish malicious behaviors and is more effective than the fixed-length n-gram in all four performance indexes of the classification algorithms such as ID3, Random Forest, AdboostM1, and Bagging.


Entropy ◽  
2019 ◽  
Vol 21 (3) ◽  
pp. 235 ◽  
Author(s):  
Hong Yang ◽  
Ke Zhao ◽  
Guohui Li

Sea environment complexity and underwater acoustic channels make it hard to extract features of ship-radiated noise signals. This paper presents a novel feature extraction method using the advantages of variational mode decomposition (VMD), fluctuation-based dispersion entropy (FDE) and self-organizing feature map (SOM). Firstly, VMD decomposition of the original signal is used to get a group of bandwidth-limited intrinsic mode functions (IMFs). Then, the difference between the FDE of each IMF and the original signal is calculated, respectively; the IMF with the smallest difference (SIMF) is selected to calculate the FDE as the feature vector. Finally, the characteristic vectors are sent to the SOM classifier to categorize the original signal. The proposed method is applied to feature extraction of real ship-radiated noise signals. The results show that this method is more precise for ship-radiated noise signals feature extraction.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Jing Li

Music is an abstract art form that uses sound as its means of expression. It has deeply affected our lives. This paper proposes a method for extracting segment features from nonmultiple cluster music files. We divide each piece of music into multiple segments and extract the features of each segment. The specific process includes nonmultiple cluster music file note extraction, main melody extraction, segment division, and segment feature extraction. The segment feature is extracted from a segment of a piece of music, contains the main melody and accompaniment information of the segment, and can reflect the sequence relationship of the notes. This paper proposes a performance style conversion network based on recurrent neural network and convolutional neural network. The bidirectional recurrent neural network based on Gated Recurrent Unit (GRU) is used to extract different styles of note feature vector sequences, and the extracted note feature vector sequence is used to predict the intensity of a specific style, and the intensity changes of different styles of nonmultiple cluster music are better learned. Through the comparison, the multiclassification strategy of “one-to-the-rest” is selected, and the fuzzy recurrent neural network is applied to the shortcomings of the unrecognizable area. Finally, according to the feature extraction method and the principle of the classifier algorithm studied in this paper, a music style classification system is implemented in the MATLAB environment. Experimental simulation shows that this system can effectively classify music performance styles.


This paper presents a feature extraction method for optical Braille recognition (OBR) system to locate, extract and convert the Braille cells in one sided Indian language Braille documents. The Braille cells are located by implementing a gridbox designed using physical properties of a Braille cell. A Braille document image is a compilation of group of six dots. The physical position of each dot and its relevance with other neighboring dots in a single cell gives various Braille characters. After the grid-box is mapped with the Braille cells in the document, the mesh characters are extracted and are then mapped with existing database to translate them in required text. Mapping of Braille cells with mesh box and separation of characters and words from a Braille document was a challenging task. The unwanted dots or degraded dots way result in incorrect mapping of characters. In this paper we have used N-gram Language Models to Predict the word Sequence in case of wrong mapping of characters in extraction and conversion of the Braille cells.


2014 ◽  
Vol 10 (1) ◽  
pp. 155
Author(s):  
La Ode Hasnuddin Sagala ◽  
Agus Harjoko

AbstrakPada sebuah sistem recognition, pemilihan metode ekstraksi ciri dan ukuran fitur yang digunakan mempengaruhi tingkat keakuratan identifikasi. Berkaitan dengan hal itu, dalam penelitian ini akan dijabarkan perbandingan tiga metode ekstraksi ciri CBIR yaitu row mean image, full image, dan blocks image. Ketiga metode tersebut digunakan untuk mengidentifikasi pembicara dengan menitikberatkan pada ukuran selection feature vector yang digunakan.Data suara diperoleh dari rekaman suara menggunakan handphone. Rekaman suara berasal dari 10 orang narasumber dengan rincian 5 pria dan 5 wanita. Setiap narasumber mengucapkan lima buah kalimat yaitu Selamat Pagi, Selamat Siang, Selamat Sore, Selamat Malam, dan Dengan Siapa serta diulangi delapan kali tiap kalimat.Karena menerapkan metode CBIR maka rekaman suara yang berbentuk sinyal dikonversi menjadi image spectrogram menggunakan STFT. Kemudian spectrogram diimplementasikan ke kekre transform lalu diekstrasi cirinya. Penggunaan kekre transform bertujuan untuk menyeleksi dan mengambil kemungkinan-kemungkinan fitur yang optimal serta juga meringankan proses komputasi.Menggunakan data reference 250 image spectrogram dan data testing 150 image spectrogram memberikan hasil bahwa metode ekstraksi ciri full image memperoleh persentase identifikasi lebih tinggi yaitu 93,3% dengan ukuran fitur 32x32. Kata kunci— Identifikasi pembicara, Spektrogram, Transformasi kekre, Full image, Blocks Image, Row mean image AbstractOn a system of recognition, selection feature extraction method and feature size are used in identification affects identication rate. In that regard, this study will presents comparison three feature extraction methods namely row mean image, full image, and blocks image. The third method used to identify the speaker with a focus on the size selection feature vector are used. Sound data obtained from the mobile phone voice recording. Sound recording derived from 10 speakers consisting of 5 men and 5 women. Every speakers pronounce five sentences are Selamat Pagi, Selamat Siang, Selamat Sore, Selamat Malam, and Dengan siapa as well as repeated eight times.Because applying CBIR methods then the sound recording signal is converted into an image spectrogram using STFT. Spectrogram is formed implemented in kekre transform to extract feature. Using kekre transform aims to select and take the possibilities optimal feature also relieves the computing process.Using reference data 250 spectrogram and testing data 150 spectrogram produces results that the full image feature extraction methods obtain a higher percentage identification rate is 93,3% with a feature size of 32x32. Keywords— Speaker identification, Spectrogram, Kekre Transform, Full Image, Blocks Image, Row Mean Image


Author(s):  
Niloufar Shoeibi ◽  
Nastaran Shoeibi ◽  
Pablo Chamoso ◽  
Zakie AlizadehSani ◽  
Juan M. Corchado

Social media platforms have been entirely an undeniable part of the lifestyle for the past decade. Analyzing the information being shared is a crucial step to understanding human behavior. Social media analysis aims to guarantee a better experience for the user and risen user satisfaction. However, first, it is necessary to know how and from which aspects to compare users. In this paper, an intelligent system has been proposed to measure the similarity of Twitter profiles. For this, firstly, the timeline of each profile has been extracted using the official TwitterAPI. Then, all information is given to the proposed system. Next, in parallel, three aspects of a profile are derived. Behavioral ratios are time-series-related information showing the consistency and habits of the user. Dynamic time warping has been utilized for the comparison of the behavioral ratios of two profiles. Next, the audience network is extracted for each user, and for estimating the similarity of two sets, Jaccard similarity is used. Finally, for the Content similarity measurement, the tweets are preprocessed respecting the feature extraction method; TF-IDF and DistilBERT for feature extraction are employed and then compared using the cosine similarity method. Results have shown that TF-IDF has slightly better performance; therefore, the more straightforward solution is selected for the model. Similarity level of different profiles. As in the case study, a Random Forest classification model was trained on almost 20000 users revealed a 97.24% accuracy. This comparison enables us to find duplicate profiles with nearly the same behavior and content.


2017 ◽  
Vol 17 (02) ◽  
pp. 1750012 ◽  
Author(s):  
Mohammad Javad Parseh ◽  
Mojtaba Meftahi

Feature extraction is one of the most important steps in Optical Character Recognition (OCR) systems, that is effective in recognition accuracy. In this paper, a suitable combination of different features such as zoning, hole size, crossing counts, etc. for Persian handwritten digits recognition is proposed. Due to high number of features, feature vector dimensions will be high that increases training time exponentially. In this paper, to solve this problem, Principal Component Analysis (PCA) method is employed for reducing the feature vector dimensions. Finally, data are classified by Support Vector Machine (SVM) classification method. The proposed method has been executed on HODA dataset which is one of the largest standard datasets of Persian handwritten digits that includes 60[Formula: see text]000 training and 20[Formula: see text]000 test samples. The proposed method reaches to 99.07% of accuracy in this dataset, and the experimental results show significant improvement in accuracy of Persian handwritten OCR compared to the previous methods.


Sign in / Sign up

Export Citation Format

Share Document