Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Mark J Hill; Simon Hengchen

doi:10.1093/llc/fqz024

Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study

Digital Scholarship in the Humanities ◽

10.1093/llc/fqz024 ◽

2019 ◽

Vol 34 (4) ◽

pp. 825-843 ◽

Cited By ~ 3

Author(s):

Mark J Hill ◽

Simon Hengchen

Keyword(s):

Eighteenth Century ◽

Text Analysis ◽

Character Recognition ◽

Optical Character Recognition ◽

Ground Truth ◽

Optical Character ◽

Historical Text ◽

The Impact

Abstract This article aims to quantify the impact optical character recognition (OCR) has on the quantitative analysis of historical documents. Using Eighteenth Century Collections Online as a case study, we first explore and explain the differences between the OCR corpus and its keyed-in counterpart, created by the Text Creation Partnership. We then conduct a series of specific analyses common to the digital humanities: topic modelling, authorship attribution, collocation analysis, and vector space modelling. The article concludes by offering some preliminary thoughts on how these conclusions can be applied to other datasets, by reflecting on the potential for predicting the quality of OCR where no ground-truth exists.

Download Full-text

Kurzweil Reading Machine: A Partial Evaluation of Its Optical Character Recognition Error Rate

Journal of Visual Impairment & Blindness ◽

10.1177/0145482x7907301002 ◽

1979 ◽

Vol 73 (10) ◽

pp. 389-399

Author(s):

Gregory L. Goodrich ◽

Richard R. Bennett ◽

William R. De L'aune ◽

Harvey Lauer ◽

Leonard Mowinski

Keyword(s):

Error Rate ◽

Character Recognition ◽

Optical Character Recognition ◽

Partial Evaluation ◽

Error Rates ◽

Recognition Error ◽

Optical Character ◽

Printed Materials ◽

High Level

This study was designed to assess the Kurzweil Reading Machine's ability to read three different type styles produced by five different means. The results indicate that the Kurzweil Reading Machines tested have different error rates depending upon the means of producing the copy and upon the type style used; there was a significant interaction between copy method and type style. The interaction indicates that some type styles are better read when the copy is made by one means rather than another. Error rates varied between less than one percent and more than twenty percent. In general, the user will find that high quality printed materials will be read with a relatively high level of accuracy, but as the quality of the material decreases, the number of errors made by the machine also increases. As this error rate increases, the user will find it increasingly difficult to understand the spoken output.

Download Full-text

Improving the Accuracy of Tesseract 4.0 OCR Engine Using Convolution-Based Preprocessing

Symmetry ◽

10.3390/sym12050715 ◽

2020 ◽

Vol 12 (5) ◽

pp. 715

Author(s):

Dan Sporici ◽

Elena Cușnir ◽

Costin-Anton Boiangiu

Keyword(s):

Reinforcement Learning ◽

Character Recognition ◽

Optical Character Recognition ◽

Edit Distance ◽

Ground Truth ◽

Image Preprocessing ◽

Optical Character ◽

Great Performance ◽

Relative Change ◽

Reinforcement Learning Model

Optical Character Recognition (OCR) is the process of identifying and converting texts rendered in images using pixels to a more computer-friendly representation. The presented work aims to prove that the accuracy of the Tesseract 4.0 OCR engine can be further enhanced by employing convolution-based preprocessing using specific kernels. As Tesseract 4.0 has proven great performance when evaluated against a favorable input, its capability of properly detecting and identifying characters in more realistic, unfriendly images is questioned. The article proposes an adaptive image preprocessing step guided by a reinforcement learning model, which attempts to minimize the edit distance between the recognized text and the ground truth. It is shown that this approach can boost the character-level accuracy of Tesseract 4.0 from 0.134 to 0.616 (+359% relative change) and the F1 score from 0.163 to 0.729 (+347% relative change) on a dataset that is considered challenging by its authors.

Download Full-text

The Impact of a Modified Repeated-Reading Strategy Paired with Optical Character Recognition on the Reading Rates of Students with Visual Impairments

Journal of Visual Impairment & Blindness ◽

10.1177/0145482x0409800104 ◽

2004 ◽

Vol 98 (1) ◽

pp. 28-46 ◽

Cited By ~ 14

Author(s):

Suzan Trefry Pattillo ◽

Kathryn Wolf Heller ◽

Maureen Smith

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Repeated Reading ◽

Visual Impairments ◽

Reading Strategy ◽

Optical Character ◽

The Impact

Download Full-text

Product Label Reading System for Blind People using Support Vector Machine Algorithm

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1047.0886s219 ◽

2019 ◽

Vol 8 (6S2) ◽

pp. 179-186

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Region Of Interest ◽

Ground Truth ◽

Support Vector ◽

Svm Classifier ◽

Stroke Width ◽

Optical Character ◽

Character Size ◽

Reading System

Theoretical—This paper shows a camera based assistive content perusing of item marks from articles to support outwardly tested individuals. Camera fills in as fundamental wellspring of info. To recognize the items, the client will move the article before camera and this moving item will be identified by Background Subtraction (BGS) Method. Content district will be naturally confined as Region of Interest (ROI). Content is extricated from ROI by consolidating both guideline based and learning based technique. A tale standard based content limitation calculation is utilized by recognizing geometric highlights like pixel esteem, shading force, character size and so forth and furthermore highlights like Gradient size, slope width and stroke width are found out utilizing SVM classifier and a model is worked to separate content and non-content area. This framework is coordinated with OCR (Optical Character Recognition) to extricate content and the separated content is given as a voice yield to the client. The framework is assessed utilizing ICDAR-2011 dataset which comprise of 509 common scene pictures with ground truth.

Download Full-text

Corpus Linguistics and Eighteenth Century Collections Online (ECCO)

Research in Corpus Linguistics ◽

10.32714/ricl.09.01.03 ◽

2021 ◽

Vol 9 (1) ◽

pp. 19-34

Author(s):

Mikko Tolonen ◽

Eetu Mäkelä ◽

Ali Ijaz ◽

Leo Lahti

Keyword(s):

Eighteenth Century ◽

Corpus Linguistics ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Source ◽

Optical Character ◽

Key Aspects ◽

Machine Readable ◽

Machine Readable Form

Eighteenth Century Collections Online (ECCO) is the most comprehensive dataset available in machine-readable form for eighteenth-century printed texts. It plays a crucial role in studies of eighteenth-century language and it has vast potential for corpus linguistics. At the same time, it is an unbalanced corpus that poses a series of different problems. The aim of this paper is to offer a general overview of ECCO for corpus linguistics by analysing, for example, its publication countries and languages. We will also analyse the role of the substantial number of reprints and new editions in the data, discuss genres and the estimates of Optical Character Recognition (OCR) quality. Our conclusion is that whereas ECCO provides a valuable source for corpus linguistics, scholars need to pay attention to historical source criticism. We have highlighted key aspects that need to be taken into consideration when considering its possible uses.

Download Full-text

IMPROVEMENT OF THE COLOR TEXT IMAGE BINARIZATION METHOD USING THE MINIMUM-DISTANCE CLASSIFIER

Applied Aspects of Information Technology ◽

10.15276/aait.01.2021.5 ◽

2021 ◽

Vol 4 (1) ◽

pp. 57-70

Author(s):

Marina V. Polyakova ◽

Alexandr G. Nesteryuk

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Minimum Distance ◽

Color Images ◽

Connected Components ◽

Image Binarization ◽

Optical Character ◽

Binarization Method ◽

Text Images

Optical character recognition systems for the images are used to convert books and documents into electronic form, to automate accounting systems in business, when recognizing markers using augmented reality technologies and etс. The quality of optical character recognition, provided that binarization is applied, is largely determined by the quality of separation of the foreground pixels from the background. Methods of text image binarization are analyzed and insufficient quality of binarization is noted. As a way of research the minimum-distance classifier for the improvement of the existing method of binarization of color text images is used. To improve the quality of the binarization of color text images, it is advisable to divide image pixels into two classes, “Foreground” and “Background”, to use classification methods instead of heuristic threshold selection, namely, a minimum-distance classifier. To reduce the amount of processed information before applying the classifier, it is advisable to select blocks of pixels for subsequent processing. This was done by analyzing the connected components on the original image. An improved method of the color text image binarization with the use of analysis of connected components and minimum-distance classifier has been elaborated. The research of the elaborated method showed that it is better than existing binarization methods in terms of robustness of binarization, but worse in terms of the error of the determining the boundaries of objects. Among the recognition errors, the pixels of images from the class labeled “Foreground” were more often mistaken for the class labeled “Background”. The proposed method of binarization with the uniqueness of class prototypes is recommended to be used in problems of the processing of color images of the printed text, for which the error in determining the boundaries of characters as a result of binarization is compensated by the thickness of the letters. With a multiplicity of class prototypes, the proposed binarization method is recommended to be used in problems of processing color images of handwritten text, if high performance is not required. The improved binarization method has shown its efficiency in cases of slow changes in the color and illumination of the text and background, however, abrupt changes in color and illumination, as well as a textured background, do not allowing the binarization quality required for practical problems.

Download Full-text

Effects of Ambient Light, Camcorder Settings, and Automated License Plate Reader Settings on Plate Transcription Rates

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/1804-08 ◽

2002 ◽

Vol 1804 (1) ◽

pp. 56-61 ◽

Cited By ~ 1

Author(s):

Michael Plotnikov ◽

Paul W. Shuldiner

Keyword(s):

Character Recognition ◽

Template Matching ◽

Optical Character Recognition ◽

License Plate ◽

Ambient Light ◽

Light Conditions ◽

Video Images ◽

Optical Character ◽

Plate Reader

The ability of an automated license plate reading (ALPR) system to convert video images of license plates into computer records depends on many factors. Of these, two are readily controlled by the operator: the quality of the video images captured in the field and the internal settings of the ALPR used to transcribe these images. A third factor, the light conditions under which the license plate images are acquired, is less easily managed, especially when camcorders are used in the field under ambient light conditions. A set of experiments was conducted to test the effects of ambient light conditions, video camcorder adjustments, and internal ALPR settings on the percent of correct reads attained by a specific type of ALPR, one whose optical character recognition process is based on template matching. Images of rear license plates were collected under four ambient light conditions: overcast with no shadows, and full sunlight with the sun in front of the camcorder, behind the camcorder, and orthogonal to the line of sight. Three camcorder exposure settings were tested. Two of the settings made use of the camcorder’s internal light meter, and the third relied solely on operator judgment. The license plates read ranged from 41% to 72%, depending most strongly on ambient light conditions. In all cases, careful adjustment of the ALPR led to significantly improved read rates over those obtained by using the manufacturer’s recommended default settings. Exposure settings based on the operator’s judgment worked best in all instances.

Download Full-text

OPTICAL CHARACTER RECOGNITION FOR ELECTRONIC INVOICES USING AWS SERVICES

International Journal of Engineering Applied Sciences and Technology ◽

10.33564/ijeast.2021.v06i05.036 ◽

2021 ◽

Vol 6 (5) ◽

Author(s):

Sameer M. Patel ◽

Sarvesh S. Pai ◽

Mittal B. Jain ◽

Vaibhav P. Vasani

Keyword(s):

Character Recognition ◽

Web Application ◽

Optical Character Recognition ◽

Credit Cards ◽

Text Recognition ◽

Service Architecture ◽

The Past ◽

Optical Character ◽

Handwritten Text

Optical Character Recognition is basically the mechanical or electronic conversion of printed or handwritten text into machine understandable text. The complication of Optical Character Recognition in different conditions remains as relevant as it was in the past few years. At the present time of automation and innovations, Keyboarding remains the most common way of inputting or feeding data into computers. This is probably the most time consuming and labor-intensive operation in the industry. Automating the process of recognition of documents, credit cards, electronic invoices, and license plates of cars – all of this could help in saving time for analyzing and processing data. With the increased research and development of machine learning, the quality of text recognition is continuously growing better. Our paper is focused on providing a brief explanation of the different stages involved in the process of optical character recognition and through the proposed application; we aim to automate the process of extraction of important texts from electronic invoices. The main goal of the project is to develop a real time OCR web application with a micro service architecture, which would help in extracting necessary information from an invoice.

Download Full-text