scholarly journals Infrastructure for Long-term Preservation and OCR Analysis of Herbarium Images

Author(s):  
Nicolas Cazenave

Herbaria hold large numbers of specimens: approximately 22 million herbarium specimens exist as botanical reference objects in Germany, 20 million in France and about 500 million worldwide. High resolution digital images of these specimens take up substantial bandwidth and disk space. New methods of extracting information from the specimen labels have been developed using OCR (Optical character recognition) techniques, but the exploitation of this technology for biological specimens is particularly complex due to the presence of biological material in the image with the text, the non-standard vocabularies, alongside the variation and age of the fonts. Much of the information is handwritten and natural handwriting pattern recognition is a less mature technology than OCR. Today, our system (eTDR-European Trusted digital Repository) provides the OCR technology (using Tesseract software) adapted to the requirements of herbarium specimen images and requires minimal installation in each institution. This is what we propose to make available to botanists with our portal. The goal for a museum is to be able to submit a large number of scanned images easily to a long-term archiving system in order to automatically obtain OCR texts and retrieve them by a full text search on an open data portal. Most of the images are provided for reuse through CC-BY licenses. In each case, the rights of reuse associated with the data are specified in associated metadata. This pilot was an opportunity to test the long-term storage service eTDR provided by CINES. The services (B2SAFE, B2Handle) developed by EUDAT were used to facilitate the transfer of data to the storage repository and to provide indexing services for access to that repository. This workflow that has been tested for the european project ICEDIG is presented as a poster: See the document (Suppl. material 1).

1982 ◽  
Vol 30 (6) ◽  
pp. 504-511 ◽  
Author(s):  
M V Sofroniew ◽  
U Schrell

A procedure is described for the dilution and storage of antisera in glass staining jars into which whole slides are immersed for incubation during light microscopic neuropeptide immunocytochemistry. Diluted antisera, stored at 4 degrees C and continuously reused, were found to be stable for long periods of time (to date over 3 years), and consistently yielded high quality staining in both single- and two-color immunoperoxidase staining. We found this procedure to be more convenient than conventional incubation procedures, allowing the more rapid processing of large numbers of slides and reducing the loss of slides due to technical errors. The consistency and reproducibility of day to day staining were also improved. The immersion of whole slides into the antisera permitted the use of long incubation times (up to 7 days) without the sections drying out, which in many cases substantially enhanced the sensitivity of the staining obtained. A procedure for two-color immunoperoxidase staining is described using diaminobenzidine for a brown color and alpha-naphthol/pyronin for a red/purple color. We found the alpha-naphthol/pyronin reaction superior to the more commonly used 4-chlornaphthol reaction as a second color. The two-color staining was found useful not only for demonstrating nerve cell bodies stained different colors, but also for staining nerve terminals one color that are around and contacting nerve cell bodies stained another color.


In the proposed paper we introduce a new Pashtu numerals dataset having handwritten scanned images. We make the dataset publically available for scientific and research use. Pashtu language is used by more than fifty million people both for oral and written communication, but still no efforts are devoted to the Optical Character Recognition (OCR) system for Pashtu language. We introduce a new method for handwritten numerals recognition of Pashtu language through the deep learning based models. We use convolutional neural networks (CNNs) both for features extraction and classification tasks. We assess the performance of the proposed CNNs based model and obtained recognition accuracy of 91.45%.


Author(s):  
Ingrid Dillo ◽  
Lisa De Leeuw

Open data and data management policies that call for the long-term storage and accessibility of data are becoming more and more commonplace in the research community. With it the need for trustworthy data repositories to store and disseminate data is growing. CoreTrustSeal, a community based and non-profit organisation, offers data repositories a core level certification based on the DSA-WDS Core Trustworthy Data Repositories Requirements catalogue and procedures. This universal catalogue of requirements reflects the core characteristics of trustworthy data repositories. Core certification involves an uncomplicated process whereby data repositories supply evidence that they are sustainable and trustworthy. A repository first conducts an internal self-assessment, which is then reviewed by community peers. Once the self-assessment is found adequate the CoreTrustSeal board certifies the repository with a CoreTrustSeal. The Seal is valid for a period of three years. Being a certified repository has several external and internal benefits. It for instance improves the quality and transparency of internal processes, increases awareness of and compliance with established standards, builds stakeholder confidence, enhances the reputation of the repository, and demonstrates that the repository is following good practices. It is also offering a benchmark for comparison and helps to determine the strengths and weaknesses of a repository. In the future we foresee a larger uptake through different domains, not in the least because within the European Open Science Cloud, the FAIR principles and therefore also the certification of trustworthy digital repositories holding data is becoming increasingly important. Next to that the CoreTrustSeal requirements will most probably become a European Technical standard which can be used in procurement (under review by the European Commission).


2019 ◽  
Vol 8 (04) ◽  
pp. 24586-24602
Author(s):  
Manpreet Kaur ◽  
Balwinder Singh

Text classification is a crucial step for optical character recognition. The output of the scanner is non- editable. Though one cannot make any change in scanned text image, if required. Thus, this provides the feed for the theory of optical character recognition. Optical Character Recognition (OCR) is the process of converting scanned images of machine printed or handwritten text into a computer readable format. The process of OCR involves several steps including pre-processing after image acquisition, segmentation, feature extraction, and classification. The incorrect classification is like a garbage in and garbage out. Existing methods focuses only upon the classification of unmixed characters in Arab, English, Latin, Farsi, Bangla, and Devnagari script. The Hybrid Techniques is solving the mixed (Machine printed and handwritten) character classification problem. Classification is carried out on different kind of daily use forms like as self declaration forms, admission forms, verification forms, university forms, certificates, banking forms, dairy forms, Punjab govt forms etc. The proposed technique is capable to classify the handwritten and machine printed text written in Gurumukhi script in mixed text. The proposed technique has been tested on 150 different kinds of forms in Gurumukhi and Roman scripts. The proposed techniques achieve 93% accuracy on mixed character form and 96% accuracy achieves on unmixed character forms. The overall accuracy of the proposed technique is 94.5%.


2018 ◽  
Vol 2 ◽  
pp. e25699
Author(s):  
Matthew Collins ◽  
Gaurav Yeole ◽  
Paul Frandsen ◽  
Rebecca Dikow ◽  
Sylvia Orli ◽  
...  

iDigBio Matsunaga et al. 2013 currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our compute infrastructure. Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we have built a model pipeline for applying user-defined processing to any subset of the images stored in iDigBio. This pipeline is run on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. We use Apache Spark, the Hadoop File System (HDFS), and Mesos to perform the processing. We have placed a Jupyter notebook server in front of this architecture which provides an easy environment with deep learning libraries for Python already loaded for end users to write their own models. Users can access the stored data and images and manipulate them according to their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we applied a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury containing solutions Schuettpelz et al. 2017. The model was trained with Smithsonian resources on their images and transferred to the GUODA infrastructure hosted at ACIS which also houses iDigBio. We then applied this model to additional images in iDigBio to classify them to illustrate the application of these techniques to broad image corpora potentially to notify other data publishers of contamination. We present the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.


2019 ◽  
Vol 24 (1) ◽  
pp. 205-220
Author(s):  
Christian Steiner ◽  
Robert Klugseder

Abstract The Digital Humanities project ‘CANTUS NETWORK. Libri ordinarii of the Salzburg metropolitan province’ undertakes research around the liturgy and music of the churches and monasteries of the medieval ecclesiastical province of Salzburg. Key sources are the liturgical ‘prompt books’, called libri ordinarii, which include a short form of more or less the entire rite of a diocese or a monastery. The workflow of the project is set in an environment called GAMS, a humanities research data repository built for long-term storage and presentation of data coming from the humanities. Digital editions of the libri ordinarii of the province were generated with the aim of enabling a comparative analysis of the various different traditions. As a first step, the books were transcribed with strict rule-based tags in Microsoft Word and transformed to TEI using the community’s XSLT stylesheets and a Java-based script. Subsequently, Semantic Web technologies were deployed to foster graph-based search and analysis of the structured data. Possible future work on the topic is facilitated by the dissemination of content levels as Linked Open Data. Further analysis is conducted with the help of Natural Language Processing methods in order to find text similarities and differences between the libri ordinarii.


Author(s):  
Ruwanmini ◽  
Liyanage ◽  
Karunarathne ◽  
Dias ◽  
Nandasara

Sinhala Inscriptions are used as one of the major sources of getting information about ancient Sri Lanka. Revealing the Information from these inscriptions becomes a huge challenge for archeologists. This research paper focused on Sinhala character recognition in ancient Sri Lankan inscription. Our intention is to ease this process by developing a web based application that enable recognition of inscription characters through scanned images and store them in an inscription database.   Using this system people can track geographical location of inscriptions.  Epigraphist could be able to easily obtain Sinhala interpretation of Sri Lankan inscriptions via the optical character recognition feature in our system. Our work on this research project provides benefits to researchers in archaeology field, epigraphists and general public who are interested in this subject.   Inscription site tracking module will present a map that user can go around easily by tracking the locations of inscriptions. This paper presents the Architecture for this Sinhala Epigraphy system.


Author(s):  
Chit San Lwin ◽  
Wu Xiangqian

Optical Character Recognition (OCR) is a technology widely adopted for automatic translation of hardcopy text to editable text. The language dependence of the technology makes it far less developed for less popular languages like Myanmar language. Also, the uniqueness and complexity of the Myanmar text system such as touching and complex characters have continued to pose serious challenges to several OCR investigators. In this paper, we propose a new technique to development Myanmar OCR system. Our technique implement skew angle detection and free skew, noisy border correction, extra page elimination, line segmentation from scanned images of Myanmar text. Performance of the proposed method is tested with 430 documents comprising different printed and handwritten Myanmar text of various fonts, sizes, multi-column, tables, stamps or photos, background effects. Our method give an accuracy of 100% for line segmentation and 99.92% for skew angle detection and free skew. The ability of our method to effectively implement global and local skew angle detection, free skew and line segmentation in different handwritten and digital text images of the Myanmar character set with high accuracies confirms the robustness of the technique, its reliability and its suitability for application in many other related languages.


2019 ◽  
Vol 4 (2) ◽  
pp. 156-172
Author(s):  
Henrique Machado dos Santos

Este estudo discute a implementação de repositórios arquivísticos em conformidade com o Sistema Aberto para Arquivamento de Informação e a necessidade de auditá-los para avaliar sua confiabilidade. Para tanto, realiza-se um levantamento bibliográfico de materiais previamente publicados, com seleção de: livros que abordam as perspectivas da Arquivística na era digital e o desafio da custódia documental confiável; publicações técnicas como as normas International Organization for Standardization e padrões de auditoria; e artigos científicos recuperados pela ferramenta de pesquisa Google Scholar, com busca temática relacionada à preservação de documentos arquivísticos digitais, repositórios digitais confiáveis, auditoria de informação e auditoria arquivística. O repositório arquivístico é o prisma da discussão, já a comparação entre os padrões de auditoria torna-se a categoria norteadora, logo, obtém-se um artigo de revisão assistemática. Dessa forma, são analisados os padrões de auditoria: Trustworthy Repository Audit & Certification: Criteria and Checklist, Catalogue of Criteria for Trusted Digital Repositories da Network of Expertise in long-term STORage, Digital Repository Audit Method Based on Risk Assessment e Audit and Certification of Trustworthy Digital Repositories. Por fim, o comparativo entre os padrões demonstra que o Audit and Certification of Trustworthy Digital Repositories é o mais indicado para auditar os repositórios arquivísticos digitais.


2004 ◽  
Vol 72 (9) ◽  
pp. 5478-5482 ◽  
Author(s):  
Kalidas Paul ◽  
Amalendu Ghosh ◽  
Nilanjan Sengupta ◽  
Rukhsana Chowdhury

ABSTRACT Spontaneous nontoxigenic mutants of highly pathogenic Vibrio cholerae O1 strains accumulate in large numbers during long-term storage of the cultures in agar stabs. In these mutants, production of the transcriptional regulator ToxR was reduced due to the presence of a mutation in the ribosome-binding site immediately upstream of the toxR open reading frame. Consequently, the ToxR-dependent virulence regulon was turned off, with concomitant reduction in the expression of cholera toxin and toxin-coregulated pilus. An intriguing feature of these mutants is that they have a competitive fitness advantage when grown in competition with the parent strains in stationary-phase cocultures which is independent of RpoS, the only locus known to be primarily associated with acquisition of a growth advantage phenotype in bacteria.


Sign in / Sign up

Export Citation Format

Share Document