Evolution Maps for Connected Components in Text Documents

Author(s):  
Ofer Biller ◽  
Klara Kedem ◽  
Itshak Dinstein ◽  
Jihad El-Sana
2016 ◽  
Vol 2 ◽  
pp. e39 ◽  
Author(s):  
Ofer Biller ◽  
Irina Rabaev ◽  
Klara Kedem ◽  
Its’hak Dinstein ◽  
Jihad J. El-Sana

Common tasks in document analysis, such as binarization, line extraction etc., are still considered difficult for highly degraded text documents. Having reliable fundamental information regarding the characters of the document, such as the distribution of character dimensions and stroke width, can significantly improve the performance of these tasks. We introduce a novel perspective of the image data which maps the evolution of connected components along the change in gray scale threshold. The maps reveal significant information about the sets of elements in the document, such as characters, noise, stains, and words. The information is further employed to improve state of the art binarization algorithm, and achieve automatically character size estimation, line extraction, stroke width estimation, and feature distribution analysis, all of which are hard tasks for highly degraded documents.


2011 ◽  
Vol 341-342 ◽  
pp. 565-569 ◽  
Author(s):  
Ahmed Dena Rafaa ◽  
Jan Nordin

One of the most important application these days in Pattern Recognition (PR) is Optical Character recognition (OCR) which is a system used to convert scanned printed or handwritten image files into machine readable and editable format such as text documents. The main motivation behind this study is to build an OCR system for offline machine-printed Turkish characters to convert any image file into a readable and editable format. This OCR system started from preprocessing step to convert the image file into a binary format with less noise to be ready for recognition. The preprocessing step includes digitization, binarization, thresholding, and noise removal. Next, horizontal projection method is used for line detection and word allocation and 8-connected neighbors’ schema is used to extract characters as a set of connected components. Then, the Template matching method is utilized to implement the matching process between the segmented characters and the template set stored in OCR database in order to recognize the text. Unlike other approaches, template matching takes shorter time and does not require sample training but it is not able to recognize some letters with similar shape or combined letters, for this reason, this OCR system combines both the template matching and the size feature of the segmented characters to achieve accurate results. Finally, upon a successful implementation of the OCR, the recognized patterns are displayed in notepad as readable and editable text. The Turkish machine-printed database consists of a list of 630 names of cities in Turkey written by using Arial font with different sizes in uppercase, lowercase and capitalizes the first character for each word. The proposed OCR’s result show that the accuracy of the system is from 96% to 100%.


Author(s):  
Alicia Fornés ◽  
Josep Lladós ◽  
Gemma Sánchez ◽  
Horst Bunke

Writer identification in handwritten text documents is an active area of study, whereas the identification of the writer of graphical documents is still a challenge. The main objective of this work is the identification of the writer in old music scores, as an example of graphic documents. The writer identification framework proposed combines three different writer identification approaches. The first one is based on the use of two symbol recognition methods, robust in front of hand-drawn distortions. The second one generates music lines and extracts information about the slant, width of the writing, connected components, contours and fractals. The third approach generates music texture images and computes textural features. The high identification rates obtained demonstrate the suitability of the proposed ensemble architecture. To the best of our knowledge, this work is the first contribution on writer identification from images containing graphical languages.


Recognizing broken characters in scanned and ancient scanned text document is not easy because the characters may be broken and unclear. Many researches have been carried to recognize these broken characters. In this research paper we have described a new broken characters recognition method for English text documents only. The proposed method uses a hybrid approach which uses connected component concepts and convolutional neural network to identify the broken characters. The input to the approach is scanned or ancient text document which contains unclear text that is difficult to recognize and hence our new proposed methodology will recognize these characters with greater accuracy and it will give the recognized characters to the user. The projected technique has attained a precision up to 92% in recognition.


Historical documents contain valuable heritage information. These documents are preserved in the manuscript preservation center and archaeological departments. They are mostly degraded in nature and hence hard to read and understand the contents. So, there is a need for text segmentation and feature extraction to convert these manuscripts into machine editable format. In this work, we present an effective way to segment historical document images into characters. It is a challenging segmentation process due to complex background images. In this paper, horizontal histogram, vertical histogram and connected component analysis is used to segment text documents images. In this algorithm, the input image is converted to gray scale image, then gray image is converted into binary image [Otsu’s method] and then all the objects containing fewer than desired pixels are removed. Line and word segmentation is implemented using horizontal and vertical histogram method respectively. Then the connected components are labeled and properties are measured for the image regions. Connected component analysis is used to segment the characters and the individual characters are extracted. The simulation result shows that the proposed segmentation method achieves an average accuracy of 93.37% for HDLAC 2011 DATASET. Moreover this method is more efficient and more suitable for real time tasks.


2019 ◽  
Vol 8 (3) ◽  
pp. 6634-6643 ◽  

Opinion mining and sentiment analysis are valuable to extract the useful subjective information out of text documents. Predicting the customer’s opinion on amazon products has several benefits like reducing customer churn, agent monitoring, handling multiple customers, tracking overall customer satisfaction, quick escalations, and upselling opportunities. However, performing sentiment analysis is a challenging task for the researchers in order to find the users sentiments from the large datasets, because of its unstructured nature, slangs, misspells and abbreviations. To address this problem, a new proposed system is developed in this research study. Here, the proposed system comprises of four major phases; data collection, pre-processing, key word extraction, and classification. Initially, the input data were collected from the dataset: amazon customer review. After collecting the data, preprocessing was carried-out for enhancing the quality of collected data. The pre-processing phase comprises of three systems; lemmatization, review spam detection, and removal of stop-words and URLs. Then, an effective topic modelling approach Latent Dirichlet Allocation (LDA) along with modified Possibilistic Fuzzy C-Means (PFCM) was applied to extract the keywords and also helps in identifying the concerned topics. The extracted keywords were classified into three forms (positive, negative and neutral) by applying an effective machine learning classifier: Convolutional Neural Network (CNN). The experimental outcome showed that the proposed system enhanced the accuracy in sentiment analysis up to 6-20% related to the existing systems.


The movement along the glide path of an unmanned aerial vehicle during landing on an aircraft carrier is investigated. The implementation of this task is realized in the conditions of radio silence of the aircraft carrier. The algorithm for treatment information from an optical landing system installed on an aircraft carrier is developed. The algorithm of the color signal recognition assumes the usage of the image frame preliminary treatment method via a downsample function, that performs the decimation process, the HSV model, the Otsu’s method for calculating the binarization threshold for a halftone image, and the method of separating the connected Two-Pass components. Keywords unmanned aerial vehicle; aircraft carrier; approach; glide path; optical landing system; color signal recognition algorithm; decimation; connected components; halftone image binarization


Author(s):  
Laith Mohammad Abualigah ◽  
Essam Said Hanandeh ◽  
Ahamad Tajudin Khader ◽  
Mohammed Abdallh Otair ◽  
Shishir Kumar Shandilya

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.


Sign in / Sign up

Export Citation Format

Share Document