scholarly journals Image Purification Technique for Myanmar OCR Applying Skew Angle Detection and Free Skew

Author(s):  
Chit San Lwin ◽  
Wu Xiangqian

Optical Character Recognition (OCR) is a technology widely adopted for automatic translation of hardcopy text to editable text. The language dependence of the technology makes it far less developed for less popular languages like Myanmar language. Also, the uniqueness and complexity of the Myanmar text system such as touching and complex characters have continued to pose serious challenges to several OCR investigators. In this paper, we propose a new technique to development Myanmar OCR system. Our technique implement skew angle detection and free skew, noisy border correction, extra page elimination, line segmentation from scanned images of Myanmar text. Performance of the proposed method is tested with 430 documents comprising different printed and handwritten Myanmar text of various fonts, sizes, multi-column, tables, stamps or photos, background effects. Our method give an accuracy of 100% for line segmentation and 99.92% for skew angle detection and free skew. The ability of our method to effectively implement global and local skew angle detection, free skew and line segmentation in different handwritten and digital text images of the Myanmar character set with high accuracies confirms the robustness of the technique, its reliability and its suitability for application in many other related languages.

In the proposed paper we introduce a new Pashtu numerals dataset having handwritten scanned images. We make the dataset publically available for scientific and research use. Pashtu language is used by more than fifty million people both for oral and written communication, but still no efforts are devoted to the Optical Character Recognition (OCR) system for Pashtu language. We introduce a new method for handwritten numerals recognition of Pashtu language through the deep learning based models. We use convolutional neural networks (CNNs) both for features extraction and classification tasks. We assess the performance of the proposed CNNs based model and obtained recognition accuracy of 91.45%.


Author(s):  
Salam Ayad Hussein ◽  
Mohsin Raad Kareem

Sentiment analysis is examined within natural language processing. It is instrumental in finding the sentiment (feeling) or opinion (idea) hidden within a text. This research focuses on finding sentiments in a “text image” and then classifying them whether they are desirable or not. These phrases and words refer to the perspectives of people about anything they think about it, such as services, products, governments, and social media events. In this study, the optical character recognition (OCR) algorithm was used, which is considered as a classification procedure of visual patterns that appear in the form of a digital image. Moreover, the Naïve Bayes machine learning algorithm was employed to classify these texts. These two algorithms form a hybrid system that supports our needs, especially in this day of technological advances and frequent use of websites and sharing of text images through the internet. Finally, the new vision in this work involves dealing with Arabic language texts that are transformed into images, which are extracted from a URL address and then classified into desirable and undesirable content.


Author(s):  
Janarthanan A ◽  
Pandiyarajan C ◽  
Sabarinathan M ◽  
Sudhan M ◽  
Kala R

Optical character recognition (OCR) is a process of text recognition in images (one word). The input images are taken from the dataset. The collected text images are implemented to pre-processing. In pre-processing, we can implement the image resize process. Image resizing is necessary when you need to increase or decrease the total number of pixels, whereas remapping can occur when you are zooming refers to increase the quantity of pixels, so that when you zoom an image, you will see clear content. After that, we can implement the segmentation process. In segmentation, we can segment the each characters in one word. We can extract the features values from the image that means test feature. In classification process, we have to classify the text from the image. Image classification is performed the images in order to identify which image contains text. A classifier is used to identify the image containing text. The experimental results shows that the accuracy.


2019 ◽  
Vol 8 (04) ◽  
pp. 24586-24602
Author(s):  
Manpreet Kaur ◽  
Balwinder Singh

Text classification is a crucial step for optical character recognition. The output of the scanner is non- editable. Though one cannot make any change in scanned text image, if required. Thus, this provides the feed for the theory of optical character recognition. Optical Character Recognition (OCR) is the process of converting scanned images of machine printed or handwritten text into a computer readable format. The process of OCR involves several steps including pre-processing after image acquisition, segmentation, feature extraction, and classification. The incorrect classification is like a garbage in and garbage out. Existing methods focuses only upon the classification of unmixed characters in Arab, English, Latin, Farsi, Bangla, and Devnagari script. The Hybrid Techniques is solving the mixed (Machine printed and handwritten) character classification problem. Classification is carried out on different kind of daily use forms like as self declaration forms, admission forms, verification forms, university forms, certificates, banking forms, dairy forms, Punjab govt forms etc. The proposed technique is capable to classify the handwritten and machine printed text written in Gurumukhi script in mixed text. The proposed technique has been tested on 150 different kinds of forms in Gurumukhi and Roman scripts. The proposed techniques achieve 93% accuracy on mixed character form and 96% accuracy achieves on unmixed character forms. The overall accuracy of the proposed technique is 94.5%.


2021 ◽  
Vol 4 (1) ◽  
pp. 57-70
Author(s):  
Marina V. Polyakova ◽  
Alexandr G. Nesteryuk

Optical character recognition systems for the images are used to convert books and documents into electronic form, to automate accounting systems in business, when recognizing markers using augmented reality technologies and etс. The quality of optical character recognition, provided that binarization is applied, is largely determined by the quality of separation of the foreground pixels from the background. Methods of text image binarization are analyzed and insufficient quality of binarization is noted. As a way of research the minimum-distance classifier for the improvement of the existing method of binarization of color text images is used. To improve the quality of the binarization of color text images, it is advisable to divide image pixels into two classes, “Foreground” and “Background”, to use classification methods instead of heuristic threshold selection, namely, a minimum-distance classifier. To reduce the amount of processed information before applying the classifier, it is advisable to select blocks of pixels for subsequent processing. This was done by analyzing the connected components on the original image. An improved method of the color text image binarization with the use of analysis of connected components and minimum-distance classifier has been elaborated. The research of the elaborated method showed that it is better than existing binarization methods in terms of robustness of binarization, but worse in terms of the error of the determining the boundaries of objects. Among the recognition errors, the pixels of images from the class labeled “Foreground” were more often mistaken for the class labeled “Background”. The proposed method of binarization with the uniqueness of class prototypes is recommended to be used in problems of the processing of color images of the printed text, for which the error in determining the boundaries of characters as a result of binarization is compensated by the thickness of the letters. With a multiplicity of class prototypes, the proposed binarization method is recommended to be used in problems of processing color images of handwritten text, if high performance is not required. The improved binarization method has shown its efficiency in cases of slow changes in the color and illumination of the text and background, however, abrupt changes in color and illumination, as well as a textured background, do not allowing the binarization quality required for practical problems.


Author(s):  
Ruwanmini ◽  
Liyanage ◽  
Karunarathne ◽  
Dias ◽  
Nandasara

Sinhala Inscriptions are used as one of the major sources of getting information about ancient Sri Lanka. Revealing the Information from these inscriptions becomes a huge challenge for archeologists. This research paper focused on Sinhala character recognition in ancient Sri Lankan inscription. Our intention is to ease this process by developing a web based application that enable recognition of inscription characters through scanned images and store them in an inscription database.   Using this system people can track geographical location of inscriptions.  Epigraphist could be able to easily obtain Sinhala interpretation of Sri Lankan inscriptions via the optical character recognition feature in our system. Our work on this research project provides benefits to researchers in archaeology field, epigraphists and general public who are interested in this subject.   Inscription site tracking module will present a map that user can go around easily by tracking the locations of inscriptions. This paper presents the Architecture for this Sinhala Epigraphy system.


The Sindhi language uses extended and modified set of the Arabic and Persian alphabets. It is the largest extension of the Arabic alphabet. Thus, Arabic or any other Arabic script based language character recognition system cannot recognize all characters of Sindhi. From character recognition point of view, Sindhi is a tough script. Sindhi’s cursive, context-sensitive characteristics, a large set of characters, highly similar shapes of the basic character, font-type variations, and size variations create high challenges for Sindhi character recognition research. In addition, line segmentation is a hard task as we have non-uniformity in the line heights. In this paper, we present an algorithm for segmenting the Sindhi text image into lines. The proposed algorithm solves the over-segmentation and under-segmentation problems in the line segmentation for Sindhi documents. The algorithm is tested on 100 text images of different Sindhi books and it has successfully segmented 99.95% lines correctly.


2019 ◽  
Vol 16 (2) ◽  
pp. 0409
Author(s):  
Ali Et al.

            Human Interactive Proofs (HIPs) are automatic inverse Turing tests, which are intended to differentiate between people and malicious computer programs. The mission of making good HIP system is a challenging issue, since the resultant HIP must be secure against attacks and in the same time it must be practical for humans. Text-based HIPs is one of the most popular HIPs types. It exploits the capability of humans to recite text images more than Optical Character Recognition (OCR), but the current text-based HIPs are not well-matched with rapid development of computer vision techniques, since they are either vey simply passed or very hard to resolve, thus this motivate that continuous efforts are required to improve the development of HIPs base text. In this paper, a new proposed scheme is designed for animated text-based HIP; this scheme exploits the gap between the usual perception of human and the ability of computer to mimic this perception and to achieve more secured and more human usable HIP. This scheme could prevent attacks since it's hard for the machine to distinguish characters with animation environment displayed by digital video, but it's certainly still easy and practical to be used by humans because humans are attuned to perceiving motion easily. The proposed scheme has been tested by many Optical Character Recognition applications, and it overtakes all these tests successfully and it achieves a high usability rate of 95%.


2019 ◽  
Vol 16 (2) ◽  
pp. 0409
Author(s):  
Ali Et al.

            Human Interactive Proofs (HIPs) are automatic inverse Turing tests, which are intended to differentiate between people and malicious computer programs. The mission of making good HIP system is a challenging issue, since the resultant HIP must be secure against attacks and in the same time it must be practical for humans. Text-based HIPs is one of the most popular HIPs types. It exploits the capability of humans to recite text images more than Optical Character Recognition (OCR), but the current text-based HIPs are not well-matched with rapid development of computer vision techniques, since they are either vey simply passed or very hard to resolve, thus this motivate that continuous efforts are required to improve the development of HIPs base text. In this paper, a new proposed scheme is designed for animated text-based HIP; this scheme exploits the gap between the usual perception of human and the ability of computer to mimic this perception and to achieve more secured and more human usable HIP. This scheme could prevent attacks since it's hard for the machine to distinguish characters with animation environment displayed by digital video, but it's certainly still easy and practical to be used by humans because humans are attuned to perceiving motion easily. The proposed scheme has been tested by many Optical Character Recognition applications, and it overtakes all these tests successfully and it achieves a high usability rate of 95%.


2021 ◽  
Vol 11 (6) ◽  
pp. 7968-7973
Author(s):  
M. Kazmi ◽  
F. Yasir ◽  
S. Habib ◽  
M. S. Hayat ◽  
S. A. Qazi

Urdu Optical Character Recognition (OCR) based on character level recognition (analytical approach) is less popular as compared to ligature level recognition (holistic approach) due to its added complexity, characters and strokes overlapping. This paper presents a holistic approach Urdu ligature extraction technique. The proposed Photometric Ligature Extraction (PLE) technique is independent of font size and column layout and is capable to handle non-overlapping and all inter and intra overlapping ligatures. It uses a customized photometric filter along with the application of X-shearing and padding with connected component analysis, to extract complete ligatures instead of extracting primary and secondary ligatures separately. A total of ~ 2,67,800 ligatures were extracted from scanned Urdu Nastaliq printed text images with an accuracy of 99.4%. Thus, the proposed framework outperforms the existing Urdu Nastaliq text extraction and segmentation algorithms. The proposed PLE framework can also be applied to other languages using the Nastaliq script style, languages such as Arabic, Persian, Pashto, and Sindhi.


Sign in / Sign up

Export Citation Format

Share Document