scholarly journals Service-based information extraction from herbarium specimens

2018 ◽  
Vol 2 ◽  
pp. e25415
Author(s):  
Fabian Reimeier ◽  
Dominik Röpert ◽  
Anton Güntsch ◽  
Agnes Kirchhoff ◽  
Walter G. Berendsohn

On herbarium sheets, data elements such as plant name, collection site, collector, barcode and accession number are found mostly on labels glued to the sheet. The data are thus visible on specimen images. With continuously improving technologies for collection mass-digitisation it has become easier and easier to produce high quality images of herbarium sheets and in the last few years herbarium collections worldwide have started to digitize specimens on an industrial scale (Tegelberg et al. 2014). To use the label data contained in these massive numbers of images, they have to be captured and databased. Currently, manual data entry prevails and forms the principal cost and time limitation in the digitization process. The StanDAP-Herb Project has developed a standard process for (semi-) automatic detection of data on herbarium sheets. This is a formal extensible workflow integrating a wide range of automated specimen image analysis services, used to replace time-consuming manual data input as far as possible. We have created web-services for OCR (Optical Character Recognition); for identifying regions of interest in specimen images and for the context-sensitive extraction of information from text recognized by OCR. We implemented the workflow as an extension of the OpenRefine platform (Verborgh and De Wilde 2013).

Radiocarbon ◽  
1983 ◽  
Vol 25 (2) ◽  
pp. 661-666 ◽  
Author(s):  
Steinar Gulliksen

Computer storage and surveys of large sets of data should be an attractive technique for users of 14C dates. Our pilot project demonstrates the effectiveness of a text retrieval system, NOVA STATUS. A small database comprising ca 100 dates, selected from results of the Trondheim 14C laboratory, is generated. Data entry to the computer is made by feeding typewritten forms through a document reader capable of optical character recognition. A text retrieval system allows data input to be in a flexible format. Program systems for text retrieval are in common use and easily implemented for a 14C database.


1997 ◽  
Vol 9 (1-3) ◽  
pp. 1-16
Author(s):  
Tim Coles ◽  
Andrew Alexander ◽  
Gareth Shaw

Directories are a universal data source widely used in urban historical research. This paper reports on a series of experiments to explore the applicability of Optical Character Recognition (OCR) technology as a means of mass directory data entry.


2012 ◽  
Vol 6 (1-2) ◽  
pp. 111-119 ◽  
Author(s):  
Elspeth Haston ◽  
Robert Cubey ◽  
David J. Harris

Logistically, the data associated with biological collections can be divided into three main categories for digitisation: i) Label Data: the data appearing on the specimen on a label or annotation; ii) Curatorial Data: the data appearing on containers, boxes, cabinets and folders which hold the collections; iii) Supplementary Data: the data held separately from the collections in indices, archives and literature. Each of these categories of data have fundamentally different properties within the digitisation framework which have implications for the data capture process. These properties were assessed in relation to alternative data entry workflows and methodologies to create a more efficient and accurate system of data capture. We see a clear benefit in the prioritisation of curatorial data in the data capture process. These data are often only available at the cabinets, they are in a format suitable for allowing rapid data entry, and they result in an accurate cataloguing of the collections. Finally, the capture of a high resolution digital image enables additional data entry to be separated into multiple sweeps, and optical character recognition (OCR) software can be used to facilitate sorting images for fuller data entry, and giving potential for more automated data entry in the future.


Author(s):  
Sally King ◽  
Juliette Pinon ◽  
Robyn Drinkwater

Digitisation of specimens at the Royal Botanic Garden Edinburgh (RBGE) has created nearly half a million imaged specimens. With data entry from the specimen labels on herbarium sheets identified as the rate-limiting step in the digitisation workflow, the majority of specimens are databased with minimal data (filing name and geographical region), leaving a need to add further label data (collector, collecting locality, collection date etc.) to make the specimens research ready. We are exploring a number of different ways to complete data entry for specimens that have been imaged. These have included Optical Character Recognition (OCR), to identify meaningful specimen groupings to increase the speed of data entry and more recently citizen science platforms to provide accurate crowd-sourced transcriptions of specimen label data. We sent specimen images of the Australian flowering plants held at RBGE herbarium to DigiVol (https://volunteer.ala.org.au/institution/index/21309224), the citizen science platform developed alongside The Atlas of Living Australia. In 29 expeditions, 156 citizen scientists completed collection label data entry for RBGE’s 41,000 specimens of Australian flowering plants. We found that 95% of the transcriptions were completed by less than a third (27%) of the volunteers. Of the four volunteer experience levels in DigiVol we found that the middle two, Collection Managers and Scientists, transcribed fewer specimens, but also made fewer mistakes. We found that by removing the filing name from the information provided with the expedition the number of errors in the Museum Details section of the transcription decreased, as the filing name was often added as the label name, regardless of whether this is the case. The feedback we provided for each expedition was used to highlight common errors to try and reduce their occurrence as well as to inform the volunteers of what their transcriptions had revealed about this part of the collection. We explore the citizen science transcription workflow, its rate-limiting steps and how we have worked to include the citizen science and OCR data on our online herbarium catalogue.


1997 ◽  
Vol 9 (1-3) ◽  
pp. 58-77
Author(s):  
Vitaly Kliatskine ◽  
Eugene Shchepin ◽  
Gunnar Thorvaldsen ◽  
Konstantin Zingerman ◽  
Valery Lazarev

In principle, printed source material should be made machine-readable with systems for Optical Character Recognition, rather than being typed once more. Offthe-shelf commercial OCR programs tend, however, to be inadequate for lists with a complex layout. The tax assessment lists that assess most nineteenth century farms in Norway, constitute one example among a series of valuable sources which can only be interpreted successfully with specially designed OCR software. This paper considers the problems involved in the recognition of material with a complex table structure, outlining a new algorithmic model based on ‘linked hierarchies’. Within the scope of this model, a variety of tables and layouts can be described and recognized. The ‘linked hierarchies’ model has been implemented in the ‘CRIPT’ OCR software system, which successfully reads tables with a complex structure from several different historical sources.


2020 ◽  
Vol 2020 (1) ◽  
pp. 78-81
Author(s):  
Simone Zini ◽  
Simone Bianco ◽  
Raimondo Schettini

Rain removal from pictures taken under bad weather conditions is a challenging task that aims to improve the overall quality and visibility of a scene. The enhanced images usually constitute the input for subsequent Computer Vision tasks such as detection and classification. In this paper, we present a Convolutional Neural Network, based on the Pix2Pix model, for rain streaks removal from images, with specific interest in evaluating the results of the processing operation with respect to the Optical Character Recognition (OCR) task. In particular, we present a way to generate a rainy version of the Street View Text Dataset (R-SVTD) for "text detection and recognition" evaluation in bad weather conditions. Experimental results on this dataset show that our model is able to outperform the state of the art in terms of two commonly used image quality metrics, and that it is capable to improve the performances of an OCR model to detect and recognise text in the wild.


Sign in / Sign up

Export Citation Format

Share Document