CNN-Based Page Segmentation and Object Classification for Counting Population in Ottoman Archival Documentation

Yekta Said Can; M. Erdem Kabadayı

doi:10.3390/jimaging6050032

CNN-Based Page Segmentation and Object Classification for Counting Population in Ottoman Archival Documentation

Journal of Imaging ◽

10.3390/jimaging6050032 ◽

2020 ◽

Vol 6 (5) ◽

pp. 32 ◽

Cited By ~ 1

Author(s):

Yekta Said Can ◽

M. Erdem Kabadayı

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Text Recognition ◽

Historical Documents ◽

Layout Analysis ◽

Page Segmentation ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

Different Types ◽

Archival Documentation

Historical document analysis systems gain importance with the increasing efforts in the digitalization of archives. Page segmentation and layout analysis are crucial steps for such systems. Errors in these steps will affect the outcome of handwritten text recognition and Optical Character Recognition (OCR) methods, which increase the importance of the page segmentation and layout analysis. Degradation of documents, digitization errors, and varying layout styles are the issues that complicate the segmentation of historical documents. The properties of Arabic scripts such as connected letters, ligatures, diacritics, and different writing styles make it even more challenging to process Arabic script historical documents. In this study, we developed an automatic system for counting registered individuals and assigning them to populated places by using a CNN-based architecture. To evaluate the performance of our system, we created a labeled dataset of registers obtained from the first wave of population registers of the Ottoman Empire held between the 1840s and 1860s. We achieved promising results for classifying different types of objects and counting the individuals and assigning them to populated places.

Download Full-text

Label Transcript is Done – Now what do we do with that Data?

Biodiversity Information Science and Standards ◽

10.3897/biss.2.27055 ◽

2018 ◽

Vol 2 ◽

pp. e27055

Author(s):

Robert Cubey ◽

Elspeth Haston ◽

Sally King

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Linked Data ◽

Data Stream ◽

Text Recognition ◽

Botanic Garden ◽

Optical Character ◽

Natural History Collection ◽

Handwritten Text ◽

Handwritten Text Recognition

The transcription of natural history collection labels is occurring via a variety of different methods – in-house curators, commercial operations, citizen scientists, visiting researchers, linked data, optical character recognition (OCR), handwritten text recognition (HTR), etc., but what can a collections data manager do with this flood of data? There are a whole raft of questions around this incoming data stream - who values it, who needs it, where is it stored, where is it displayed, who has access to it, etc. This talk plans to address these topics with reference to the Royal Botanic Garden Edinburgh herbarium dataset.

Download Full-text

Transcript Anatomization with Multi-Linguistic and Speech Synthesis Features

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.35371 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 1755-1758

Author(s):

Rohan Modi

Keyword(s):

Pattern Recognition ◽

Character Recognition ◽

Optical Character Recognition ◽

Speech Synthesis ◽

Handwriting Recognition ◽

Cost Effective ◽

Computer Hardware ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

Audio Output

Handwriting Detection is a process or potential of a computer program to collect and analyze comprehensible input that is written by hand from various types of media such as photographs, newspapers, paper reports etc. Handwritten Text Recognition is a sub-discipline of Pattern Recognition. Pattern Recognition is refers to the classification of datasets or objects into various categories or classes. Handwriting Recognition is the process of transforming a handwritten text in a specific language into its digitally expressible script represented by a set of icons known as letters or characters. Speech synthesis is the artificial production of human speech using Machine Learning based software and audio output based computer hardware. While there are many systems which convert normal language text in to speech, the aim of this paper is to study Optical Character Recognition with speech synthesis technology and to develop a cost effective user friendly image based offline text to speech conversion system using CRNN neural networks model and Hidden Markov Model. The automated interpretation of text that has been written by hand can be very useful in various instances where processing of great amounts of handwritten data is required, such as signature verification, analysis of various types of documents and recognition of amounts written on bank cheques by hand.

Download Full-text

A set of benchmarks for Handwritten Text Recognition on historical documents

Pattern Recognition ◽

10.1016/j.patcog.2019.05.025 ◽

2019 ◽

Vol 94 ◽

pp. 122-134 ◽

Cited By ~ 10

Author(s):

Joan Andreu Sánchez ◽

Verónica Romero ◽

Alejandro H. Toselli ◽

Mauricio Villegas ◽

Enrique Vidal

Keyword(s):

Text Recognition ◽

Historical Documents ◽

Handwritten Text ◽

Handwritten Text Recognition

Download Full-text

SVM and HMM Classifier Combination Based Approach for Online Handwritten Indic Character Recognition

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666181127124711 ◽

2020 ◽

Vol 13 (2) ◽

pp. 200-214

Author(s):

Rajib Ghosh ◽

Prabhat Kumar

Keyword(s):

Character Recognition ◽

Text Recognition ◽

Support Vector ◽

Present System ◽

Novel Approach ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

Shafer Theory ◽

Public Datasets ◽

Indic Scripts

Background: The growing use of smart hand-held devices in the daily lives of the people urges for the requirement of online handwritten text recognition. Online handwritten text recognition refers to the identification of the handwritten text at the very moment it is written on a digitizing tablet using some pen-like stylus. Several techniques are available for online handwritten text recognition in English, Arabic, Latin, Chinese, Japanese, and Korean scripts. However, limited research is available for Indic scripts. Objective: This article presents a novel approach for online handwritten numeral and character (simple and compound) recognition of three popular Indic scripts - Devanagari, Bengali and Tamil. Methods: The proposed work employs the Zone wise Slopes of Dominant Points (ZSDP) method for feature extraction from the individual characters. Support Vector Machine (SVM) and Hidden Markov Model (HMM) classifiers are used for recognition process. Recognition efficiency is improved by combining the probabilistic outcomes of the SVM and HMM classifiers using Dempster-Shafer theory. The system is trained using separate as well as combined dataset of numerals, simple and compound characters. Results: The performance of the present system is evaluated using large self-generated datasets as well as public datasets. Results obtained from the present work demonstrate that the proposed system outperforms the existing works in this regard. Conclusion: This work will be helpful to carry out researches on online recognition of handwritten character in other Indic scripts as well as recognition of isolated words in various Indic scripts including the scripts used in the present work.

Download Full-text

OPTICAL CHARACTER RECOGNITION FOR ELECTRONIC INVOICES USING AWS SERVICES

International Journal of Engineering Applied Sciences and Technology ◽

10.33564/ijeast.2021.v06i05.036 ◽

2021 ◽

Vol 6 (5) ◽

Author(s):

Sameer M. Patel ◽

Sarvesh S. Pai ◽

Mittal B. Jain ◽

Vaibhav P. Vasani

Keyword(s):

Character Recognition ◽

Web Application ◽

Optical Character Recognition ◽

Credit Cards ◽

Text Recognition ◽

Service Architecture ◽

The Past ◽

Optical Character ◽

Handwritten Text

Optical Character Recognition is basically the mechanical or electronic conversion of printed or handwritten text into machine understandable text. The complication of Optical Character Recognition in different conditions remains as relevant as it was in the past few years. At the present time of automation and innovations, Keyboarding remains the most common way of inputting or feeding data into computers. This is probably the most time consuming and labor-intensive operation in the industry. Automating the process of recognition of documents, credit cards, electronic invoices, and license plates of cars – all of this could help in saving time for analyzing and processing data. With the increased research and development of machine learning, the quality of text recognition is continuously growing better. Our paper is focused on providing a brief explanation of the different stages involved in the process of optical character recognition and through the proposed application; we aim to automate the process of extraction of important texts from electronic invoices. The main goal of the project is to develop a real time OCR web application with a micro service architecture, which would help in extracting necessary information from an invoice.

Download Full-text

A Deep Learning Technology based OCR Framework for Recognition Handwritten Expression and Text

CONVERTER ◽

10.17762/converter.259 ◽

2021 ◽

pp. 01-10

Author(s):

Tuanji Gong, Xuanxia Yao

Keyword(s):

Deep Learning ◽

Online Education ◽

Character Recognition ◽

Optical Character Recognition ◽

Mathematical Expression ◽

Text Recognition ◽

Learning Technology ◽

Expression Recognition ◽

Features Selection ◽

Handwritten Text Recognition

Recently Optical character recognition (OCR) based on deep learning technology has achieved great advance and broadly applied in various industries. However it still faces many challenging problems in handwritten text recognition and mathematical expression recognition, such as handwritten Chinese recognition, mixture of printed and handwritten Chinese characters, mathematical expression (ME), chemical equations. In traditional OCR, features selection played a vital role for recognition accuracy, while hand-crafted features are costly and time-consuming. In this paper, we introduce a deep learning based framework to detect and recognize handwritten and printed text or math expression. The framework consists of three components. The first component is DCN (Detection & Classification Network), which based on SSD model to detects and classify mathematical expression and text. The second component consists of text recognition and ME recognition models. The final component merges multiple outputs of the second stage into a whole text. Experiment results show that our framework achieves a relative 10% improvement in mixture of texts and MEs which are printed or handwritten in images. The framework has been deployed for recognition paper or homework at one online education platform.

Download Full-text

Turns of the centuries. The Transkribus automated tool for recognition, transcription and translation of handwritten historical documents

Babel ◽

10.1075/babel.00159.vuk ◽

2020 ◽

Vol 66 (2) ◽

pp. 294-310

Author(s):

Miodrag M. Vukčević

Keyword(s):

Language Change ◽

Text Recognition ◽

Historical Documents ◽

Writing Style ◽

Handwritten Documents ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

Automated Processing ◽

Automated Tool

Abstract The translation of handwritten historical documents faces many challenges due to variation in the writing style, local language, and an inevitable language change. Even the transliteration from Cyrillic to Latin characters is standardized by the bijective transliteration standard ISO 9. This presentation introduces a number of tools offered by Transkribus for the automated processing of documents, such as Handwritten Text Recognition (HTR) and Document Understanding, which are needed for the translation of historical documents. Next to the problem of decoding handwritten documents, written for example in Kurrentschrift using ancient terminology, changed meanings and different spelling have additionally to be considered during the translation of texts from earlier centuries. Resolution strategies on a case study show different methods for ensuring quality translations.

Download Full-text

Boosting of Deep Convolutional Architectures for Arabic Handwriting Recognition

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2019100102 ◽

2019 ◽

Vol 10 (4) ◽

pp. 26-45 ◽

Cited By ~ 1

Author(s):

Mohamed Elleuch ◽

Monji Kherallah

Keyword(s):

Character Recognition ◽

State Of The Art ◽

Handwriting Recognition ◽

Image Data ◽

Text Recognition ◽

Deep Belief Networks ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

Accuracy Rates ◽

Hierarchical Representations

In recent years, deep learning (DL) based systems have become very popular for constructing hierarchical representations from unlabeled data. Moreover, DL approaches have been shown to exceed foregoing state of the art machine learning models in various areas, by pattern recognition being one of the more important cases. This paper applies Convolutional Deep Belief Networks (CDBN) to textual image data containing Arabic handwritten script (AHS) and evaluated it on two different databases characterized by the low/high-dimension property. In addition to the benefits provided by deep networks, the system is protected against over-fitting. Experimentally, the authors demonstrated that the extracted features are effective for handwritten character recognition and show very good performance comparable to the state of the art on handwritten text recognition. Yet using Dropout, the proposed CDBN architectures achieved a promising accuracy rates of 91.55% and 98.86% when applied to IFN/ENIT and HACDB databases, respectively.

Download Full-text

Enhancing Predictability of Handwritten Document Content using HTR and Word Substitution

Regular Issue - International Journal of Innovative Science and Modern Engineering ◽

10.35940/ijisme.g1240.056720 ◽

2020 ◽

Vol 6 (7) ◽

pp. 15-18

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

State Of The Art ◽

Distance Analysis ◽

Blur Detection ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

Handwritten Document ◽

Positive Results ◽

Word Completion

Handwritten Text Recognition (HTR) can become progressively abysmal when the documents are damaged with smudges, blemishes and blurs. Recognition of such documents is a challenging task. We, therefore propose a system to identify textual handwritten content in documents where the state-of-the-art Optical Character Recognition (OCR) existing at its full extent performs with low accuracy. By introducing word substitution using character and distance analysis for spell checking and word completion in such areas for giving out more accurate results using a word corpus, we improved our prediction results especially in cases where the OCR is prone to predict false positives on the smudge areas predominantly. Blur detection on every word before segmentation is also substituted with a new word by our OCR algorithm to avoid false positive results and are instead substituted with suitable words. This methodology is far more convenient and reliable since even state-of-the-art HTR technologies do not have more than 71% accuracy. The accuracy of the predicted test is measured using the text similarity metric - Fuzzy Token Set Ratio (FTSR)

Download Full-text

An Interactive Approach with Off-Line and On-Line Handwritten Text Recognition Combination for Transcribing Historical Documents

2016 12th IAPR Workshop on Document Analysis Systems (DAS) ◽

10.1109/das.2016.45 ◽

2016 ◽

Cited By ~ 2

Author(s):

Emilio Granell ◽

Veronica Romero ◽

Carlos D. Martinez-Hinarejos

Keyword(s):

Text Recognition ◽

Historical Documents ◽

Interactive Approach ◽

Handwritten Text ◽

Handwritten Text Recognition ◽

On Line

Download Full-text