scholarly journals Multi - Class Document Classification: Effective and Systematized Method to Categorize Documents

Author(s):  
Kaushika Pal ◽  
Biraj V. Patel

A large section of World Wide Web is full of Documents, content; Data, Big data, unformatted data, formatted data, unstructured and unorganized data and we need information infrastructure, which is useful and easily accessible as an when required. This research work is combining approach of Natural Language Processing and Machine Learning for content-based classification of documents. Natural Language Processing is used which will divide the problem of understanding entire document at once into smaller chucks and give us only with useful tokens responsible for Feature Extraction, which is machine learning technique to create Feature Set which helps to train classifier to predict label for new document and place it at appropriate location. Machine Learning subset of Artificial Intelligence is enriched with sophisticated algorithms like Support Vector Machine, K – Nearest Neighbor, Naïve Bayes, which works well with many Indian Languages and Foreign Language content’s for classification. This Model is successful in classifying documents with more than 70% of accuracy for major Indian Languages and more than 80% accuracy for English Language.

Detecting the author of the sentence in a collective document can be done by choosing a suitable set of features and implementing using Natural Language Processing in Machine Learning. Training our machine is the basic idea to identify the author name of a specific sentence. This can be done by using 8 different NLP steps like applying stemming algorithm, finding stop-list words, preprocessing the data, and then applying it to a machine learning classifier-Support vector machine (SVM) which classify the dataset into a number of classes specifying the author of the sentence and defines the name of author for each and every sentence with an accuracy of 82%.This paper helps the readers who are interested in knowing the names of the authors who have written some specific words


Author(s):  
Ayushi Mitra

Sentiment analysis or Opinion Mining or Emotion Artificial Intelligence is an on-going field which refers to the use of Natural Language Processing, analysis of text and is utilized to extract quantify and is used to study the emotional states from a given piece of information or text data set. It is an area that continues to be currently in progress in field of text mining. Sentiment analysis is utilized in many corporations for review of products, comments from social media and from a small amount of it is utilized to check whether or not the text is positive, negative or neutral. Throughout this research work we wish to adopt rule- based approaches which defines a set of rules and inputs like Classic Natural Language Processing techniques, stemming, tokenization, a region of speech tagging and parsing of machine learning for sentiment analysis which is going to be implemented by most advanced python language.


Author(s):  
Ardianne Luthfika Fairuz ◽  
Rima Dias Ramadhani ◽  
Nia Annisa Ferani Tanjung

Akhir tahun 2019 lalu dunia digemparkan oleh munculnya suatu penyakit yang disebabkan oleh virus SARS-CoV-2 yang merupakan jenis virus terbaru dari coronavirus. Penyakit ini dikenal dengan nama COVID-19. Penyebaran penyakit ini terbilang cukup luas dan cepat. Dalam waktu singkat penyakit ini mulai menyebar ke segala penjuru dunia tak terkecuali Indonesia. Dengan tingkat penyebaran yang begitu tinggi dan belum ditemukannya vaksin untuk COVID-19, menyebabkan kekacauan di tengah masyarakat. Hal ini mempengaruhi banyak sektor kehidupan masyarakat. Tak sedikit masyarakat yang aktif bersosial media dan menuliskan pendapat, opini serta pemikirannya di platform media sosial seperti Twitter. Terjadinya pandemi ini mendorong masyarakat untuk menuliskan opini, pemikiran serta pendapatnya terhadap COVID-19 pada media sosial Twitter. Dibutuhkan suatu model sentiment analysis untuk mengklasifikasi tweet masyarakat di Twitter menjadi positif dan negatif. Sentiment analysis merupakan bagian dari Natural Language Processing yang membuat sebuah sistem guna mengenali serta mengekstraksi opini dalam  bentuk teks. Pada penelitian ini digunakan algoritma Naive Bayes dan K-Nearest Neighbor untuk digunakan dalam membangun model sentiment analysis terhadap tweet pengguna Twitter terhadap COVID-19. Didapatkan akurasi sebesar 85% untuk algoritma Naïve Bayes dan 82% untuk algoritma K-Nearest Neighbor pada nilai k=6, 8, dan 14.


2014 ◽  
Vol 8 (3) ◽  
pp. 227-235 ◽  
Author(s):  
Cíntia Matsuda Toledo ◽  
Andre Cunha ◽  
Carolina Scarton ◽  
Sandra Aluísio

Discourse production is an important aspect in the evaluation of brain-injured individuals. We believe that studies comparing the performance of brain-injured subjects with that of healthy controls must use groups with compatible education. A pioneering application of machine learning methods using Brazilian Portuguese for clinical purposes is described, highlighting education as an important variable in the Brazilian scenario.OBJECTIVE: The aims were to describe how to: (i) develop machine learning classifiers using features generated by natural language processing tools to distinguish descriptions produced by healthy individuals into classes based on their years of education; and (ii) automatically identify the features that best distinguish the groups.METHODS: The approach proposed here extracts linguistic features automatically from the written descriptions with the aid of two Natural Language Processing tools: Coh-Metrix-Port and AIC. It also includes nine task-specific features (three new ones, two extracted manually, besides description time; type of scene described - simple or complex; presentation order - which type of picture was described first; and age). In this study, the descriptions by 144 of the subjects studied in Toledo18 were used, which included 200 healthy Brazilians of both genders.RESULTS AND CONCLUSION:A Support Vector Machine (SVM) with a radial basis function (RBF) kernel is the most recommended approach for the binary classification of our data, classifying three of the four initial classes. CfsSubsetEval (CFS) is a strong candidate to replace manual feature selection methods.


2020 ◽  
Vol 41 (Supplement_2) ◽  
Author(s):  
P Brekke ◽  
I Pilan ◽  
H Husby ◽  
T Gundersen ◽  
F.A Dahl ◽  
...  

Abstract Background Syncope is a commonly occurring presenting symptom in emergency departments. While the majority of episodes are benign, syncope is associated with worse prognosis in hypertrophic cardiomyopathy, arrhythmia syndromes, heart failure, aortic stenosis and coronary heart disease. Flagging documented syncope in these patients may be crucial to management decisions. Previous studies show that the International Classification of Diseases (ICD) codes for syncope have a sensitivity of around 0.63, leading to a large number of false negatives if patient identification is based on administrative codes. Thus, in order to provide data-driven, clinical decision support, and to improve identification of patient cohorts for research, better tools are needed. A recent study manually annotated more than 30.000 patient records in order to develop a natural language processing (NLP) tool, which achieved a sensitivity of 92.2%. Since access to medical records and annotation resources is limited, we aimed to investigate whether an unsupervised machine learning and NLP approach with no manual input could achieve similar performance. Methods Our data was admission notes for adult patients admitted between 2005 and 2016 at a large university hospital in Norway. 500 records from patients with, and 500 without a “R55 Syncope” ICD code at discharge were drawn at random. R55 code was considered “ground truth”. Headers containing information about tentative diagnoses were removed from the notes, when present, using regular expressions. The dataset was divided into 70%/15%/15% subsets for training, validation and testing. Baseline identification was calculated by a simple lexical matching using the term “synkope”. We evaluated two linear classifiers, a Support Vector Machine (SVM) and a Linear Regression (LR) model, with a term frequency–inverse document frequency vectorizer, using a bag-of-words approach. In addition, we evaluated a simple convolutional neural network (CNN) consisting of a convolutional layer concatenating filter sizes of 3–5, max pooling and a dropout of 0.5 with randomly initialised word embeddings of 300 dimensions. Results Even a baseline regular expression model achieved a sensitivity of 78% and a specificity of 91% when classifying admission notes as belonging to the syncope class or not. The SVM model and the LR model achieved a sensitivity of 91% and 89%, respectively, and a specificity of 89% and 91%. The CNN model had a sensitivity of 95% and a specificity of 84%. Conclusion With a limited non-English dataset, common NLP and machine learning approaches were able to achieve approximately 90–95% sensitivity for the identification of admission notes related to syncope. Linear classifiers outperformed a CNN model in terms of specificity, as expected in this small dataset. The study demonstrates the feasibility of training document classifiers based on diagnostic codes in order to detect important clinical events. ROC curves for SVM and LR models Funding Acknowledgement Type of funding source: Public grant(s) – National budget only. Main funding source(s): The Research Council of Norway


2020 ◽  
Vol 132 (4) ◽  
pp. 738-749 ◽  
Author(s):  
Michael L. Burns ◽  
Michael R. Mathis ◽  
John Vandervest ◽  
Xinyu Tan ◽  
Bo Lu ◽  
...  

Abstract Background Accurate anesthesiology procedure code data are essential to quality improvement, research, and reimbursement tasks within anesthesiology practices. Advanced data science techniques, including machine learning and natural language processing, offer opportunities to develop classification tools for Current Procedural Terminology codes across anesthesia procedures. Methods Models were created using a Train/Test dataset including 1,164,343 procedures from 16 academic and private hospitals. Five supervised machine learning models were created to classify anesthesiology Current Procedural Terminology codes, with accuracy defined as first choice classification matching the institutional-assigned code existing in the perioperative database. The two best performing models were further refined and tested on a Holdout dataset from a single institution distinct from Train/Test. A tunable confidence parameter was created to identify cases for which models were highly accurate, with the goal of at least 95% accuracy, above the reported 2018 Centers for Medicare and Medicaid Services (Baltimore, Maryland) fee-for-service accuracy. Actual submitted claim data from billing specialists were used as a reference standard. Results Support vector machine and neural network label-embedding attentive models were the best performing models, respectively, demonstrating overall accuracies of 87.9% and 84.2% (single best code), and 96.8% and 94.0% (within top three). Classification accuracy was 96.4% in 47.0% of cases using support vector machine and 94.4% in 62.2% of cases using label-embedding attentive model within the Train/Test dataset. In the Holdout dataset, respective classification accuracies were 93.1% in 58.0% of cases and 95.0% among 62.0%. The most important feature in model training was procedure text. Conclusions Through application of machine learning and natural language processing techniques, highly accurate real-time models were created for anesthesiology Current Procedural Terminology code classification. The increased processing speed and a priori targeted accuracy of this classification approach may provide performance optimization and cost reduction for quality improvement, research, and reimbursement tasks reliant on anesthesiology procedure codes. Editor’s Perspective What We Already Know about This Topic What This Article Tells Us That Is New


Author(s):  
Piotr Tynecki ◽  
Arkadiusz Guziński ◽  
Joanna Kazimierczak ◽  
Michał Jadczuk ◽  
Jarosław Dastych ◽  
...  

AbstractBackgroundAs antibiotic resistance is becoming a major problem nowadays in a treatment of infections, bacteriophages (also known as phages) seem to be an alternative. However, to be used in a therapy, their life cycle should be strictly lytic. With the growing popularity of Next Generation Sequencing (NGS) technology, it is possible to gain such information from the genome sequence. A number of tools are available which help to define phage life cycle. However, there is still no unanimous way to deal with this problem, especially in the absence of well-defined open reading frames. To overcome this limitation, a new tool is definitely needed.ResultsWe developed a novel tool, called PhageAI, that allows to access more than 10 000 publicly available bacteriophages and differentiate between their major types of life cycles: lytic and lysogenic. The tool included life cycle classifier which achieved 98.90% accuracy on a validation set and 97.18% average accuracy on a test set. We adopted nucleotide sequences embedding based on the Word2Vec with Ship-gram model and linear Support Vector Machine with 10-fold cross-validation for supervised classification. PhageAI is free of charge and it is available at https://phage.ai/. PhageAI is a REST web service and available as Python package.ConclusionsMachine learning and Natural Language Processing allows to extract information from bacteriophages nucleotide sequences for lifecycle prediction tasks. The PhageAI tool classifies phages into either virulent or temperate with a higher accuracy than any existing methods and shares interactive 3D visualization to help interpreting model classification results.


Classification is a form of data mining (regarding machine learning) approach that is helpful in the prediction of group membership for data instances, where the data input is used by the computer program for learning and thereafter this learning is used for classifying the fresh observation made. This data set might just be bi-class or it can be multi-class also. Few instances of the problems in classification include: speech identification, handwriting identification, bio metric detection, document classification etc. Many classification methods exist, which can be utilized for classification. In this research work, the fundamental classification approaches and few important kinds of classification approaches that include decision tree induction, Bayesian networks,k-nearest neighbor classifier and Support Vector Machines (SVM) and fuzzy learning classifiers with their merits, drawbacks, probable applications and challenges faced with the solution available. There are different problems that have an effect on the classification and prediction. The objective of this research work is to render an extensive review of various classification approaches in machine learning. At last, the future work intended on the best classification techniques for the input data are discussed.


2021 ◽  
Vol 11 (2) ◽  
pp. 15-23
Author(s):  
Sabrina Jahan Maisha ◽  
Nuren Nafisa ◽  
Abdul Kadar Muhammad Masum

We can state undoubtedly that Bangla language is rich enough to work with and implement various Natural Language Processing (NLP) tasks. Though it needs proper attention, hardly NLP field has been explored with it. In this age of digitalization, large amount of Bangla news contents are generated in online platforms. Some of the contents are inappropriate for the children or aged people. With the motivation to filter out news contents easily, the aim of this work is to perform document level sentiment analysis (SA) on Bangla online news. In this respect, the dataset is created by collecting news from online Bangla newspaper archive.  Further, the documents are manually annotated into positive and negative classes. Composite process technique of “Pipeline” class including Count Vectorizer, transformer (TF-IDF) and machine learning (ML) classifiers are employed to extract features and to train the dataset. Six supervised ML classifiers (i.e. Multinomial Naive Bayes (MNB), K-Nearest Neighbor (K-NN), Random Forest (RF), (C4.5) Decision Tree (DT), Logistic Regression (LR) and Linear Support Vector Machine (LSVM)) are used to analyze the best classifier for the proposed model. There has been very few works on SA of Bangla news. So, this work is a small attempt to contribute in this field. This model showed remarkable efficiency through better results in both the validation process of percentage split method and 10-fold cross validation. Among all six classifiers, RF has outperformed others by 99% accuracy. Even though LSVM has shown lowest accuracy of 80%, it is also considered as good output. However, this work has also exhibited surpassing outcome for recent and critical Bangla news indicating proper feature extraction to build up the model.


Sign in / Sign up

Export Citation Format

Share Document