Flora Prepper: Preparing floras for morphological parsing and integration

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37743 ◽

2019 ◽

Vol 3 ◽

Author(s):

Jocelyn Pender

Keyword(s):

Text Mining ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Semantic Annotation ◽

Access Point ◽

Pilot Project ◽

Lessons Learned ◽

Content Type ◽

Morphological Parsing

The increased availability of digital floras and the application of optical character recognition (OCR) to digitized texts has resulted in exciting opportunities for flora data mining. For example, the software package CharaParser has been developed for the semantic annotation of morphological descriptions from taxonomic treatments (Cui 2012). However, after digitization and OCR processing and before parsing of morphological treatments can begin, content types must be annotated (i.e., text represents names, morphology, discussion or distribution). In addition to enabling morphological parsing, content type annotation also facilitates content search and data linkage. For example, by annotating pieces of a floral treatment, assertions from various floras of the same type can be combined into a single document (i.e., a "mash-up" floral treatment). Several products and pipelines have been developed for the semantic annotation, or mark-up, of taxonomic documents (e.g., GoldenGATE, FlorML; Sautter et al. 2012, Hamann et al. 2014). However, these products lack a combination of both ease of implementation (e.g., the ability to run as a script in a programmatic workflow) and the use of modern parsing methods, such as text mining and Natural Language Processing (NLP) approaches. Here I present a pilot project implementing text mining and NLP approaches to marking-up floras implemented in Python. I will describe the success of the project, and summarize lessons learned, especially in relation to previous flora markup projects. Annotation of existing flora documents is an essential step towards building next-generation floras (i.e., mash-ups and enhanced floras as platforms) and enables automated trait extraction. Building an easy-to-use access point to modern text mining and NLP techniques for botanical literature will allow for more flexible and responsive flora annotation, and is an important step towards realizing botanical data integration goals.

Download Full-text

ARGO, Automatic Record Generator for Oncology: a natural language process-based tool to capture pathology features from onco-hematological reports (Preprint)

10.2196/preprints.27295 ◽

2021 ◽

Author(s):

Gian Maria Zaccaria ◽

Vito Colella ◽

Simona Colucci ◽

Felice Clemente ◽

Fabio Pavone ◽

...

Keyword(s):

Natural Language ◽

Translational Research ◽

Language Processing ◽

Character Recognition ◽

Web Application ◽

Optical Character Recognition ◽

Anatomical Site ◽

Cell Of Origin ◽

Molecular Features ◽

Clinical And Translational Research

BACKGROUND The unstructured nature of medical data from Real-World (RW) patients and the scarce accessibility for researchers to integrated systems restrain the use of RW information for clinical and translational research purposes. Natural Language Processing (NLP) might help in transposing unstructured reports in electronic health records (EHR), thus prompting their standardization and sharing. OBJECTIVE We aimed at designing a tool to capture pathological features directly from hemo-lymphopathology reports and automatically record them into electronic case report forms (eCRFs). METHODS We exploited Optical Character Recognition and NLP techniques to develop a web application, named ARGO (Automatic Record Generator for Oncology), that recognizes unstructured information from diagnostic paper-based reports of diffuse large B-cell lymphomas (DLBCL), follicular lymphomas (FL), and mantle cell lymphomas (MCL). ARGO was programmed to match data with standard diagnostic criteria of the National Institute of Health, automatically assign diagnosis and, via Application Programming Interface, populate specific eCRFs on the REDCap platform, according to the College of American Pathologists templates. A selection of 239 reports (n. 106 DLBCL, n.79 FL, and n. 54 MCL) from the Pathology Unit at the IRCCS - Istituto Tumori “Giovanni Paolo II” of Bari (Italy) was used to assess ARGO performance in terms of accuracy, precision, recall and F1-score. RESULTS By applying our workflow, we successfully converted 233 paper-based reports into corresponding eCRFs incorporating structured information about diagnosis, tissue of origin and anatomical site of the sample, major molecular markers and cell-of-origin subtype. Overall, ARGO showed high performance (nearly 90% of accuracy, precision, recall and F1-score) in capturing identification report number, biopsy date, specimen type, diagnosis, and additional molecular features. CONCLUSIONS We developed and validated an easy-to-use tool that converts RW paper-based diagnostic reports of major lymphoma subtypes into structured eCRFs. ARGO is cheap, feasible, and easily transferable into the daily practice to generate REDCap-based EHR for clinical and translational research purposes.

Download Full-text

A Finding Aid for The Equity

Inquiry@Queen's Undergraduate Research Conference Proceedings ◽

10.24908/iqurcp.11597 ◽

2018 ◽

Author(s):

Jeff Blackadar

Keyword(s):

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Large Collection ◽

Digital History ◽

R Language ◽

Potential Value ◽

Optical Character ◽

Text Searching ◽

Person Location

Bibliothèque et Archives Nationales du Québec digitally scanned and converted to text a large collection of newspapers to create a resource of tremendous potential value to historians. Unfortunately, the text files are difficult to search reliably due to many errors caused by the optical character recognition (OCR) text conversion process. This digital history project applied natural language processing in an R language computer program to create a new and useful index of this corpus of digitized content despite OCR related errors. The project used editions of The Equity, published in Shawville, Quebec since 1883. The program extracted the names of all the person, location and organization entities that appeared in each edition. Each of the entities was cataloged in a database and related to the edition of the newspaper it appeared in. The database was published to a public website to allow other researchers to use it. The resulting index or finding aid allows researchers to access The Equity in a different way than just full text searching. People, locations and organizations appearing in the Equity are listed on the website and each entity links to a page that lists all of the issues that entity appeared in as well as the other entities that may be related to it. Rendering the text files of each scanned newspaper into entities and indexing them in a database allows the content of the newspaper to be interacted with by entity name and type rather than just a set of large text files. Website: http://www.jeffblackadar.ca/graham_fellowship/corpus_entities_equity/

Download Full-text

MELHISSA: a multilingual entity linking architecture for historical press articles

International Journal on Digital Libraries ◽

10.1007/s00799-021-00319-6 ◽

2021 ◽

Author(s):

Elvys Linhares Pontes ◽

Luis Adrián Cabrera-Diego ◽

Jose G. Moreno ◽

Emanuela Boros ◽

Ahmed Hamdi ◽

...

Keyword(s):

Language Processing ◽

Digital Libraries ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Documents ◽

Entity Linking ◽

Named Entities ◽

European Languages ◽

Meta Information ◽

The Impact

AbstractDigital libraries have a key role in cultural heritage as they provide access to our culture and history by indexing books and historical documents (newspapers and letters). Digital libraries use natural language processing (NLP) tools to process these documents and enrich them with meta-information, such as named entities. Despite recent advances in these NLP models, most of them are built for specific languages and contemporary documents that are not optimized for handling historical material that may for instance contain language variations and optical character recognition (OCR) errors. In this work, we focused on the entity linking (EL) task that is fundamental to the indexation of documents in digital libraries. We developed a Multilingual Entity Linking architecture for HIstorical preSS Articles that is composed of multilingual analysis, OCR correction, and filter analysis to alleviate the impact of historical documents in the EL task. The source code is publicly available. Experimentation has been done over two historical documents covering five European languages (English, Finnish, French, German, and Swedish). Results have shown that our system improved the global performance for all languages and datasets by achieving an F-score@1 of up to 0.681 and an F-score@5 of up to 0.787.

Download Full-text

Which OCR toolset is good and why? A comparative study

Kuwait Journal of Science ◽

10.48129/kjs.v48i2.9589 ◽

2021 ◽

Vol 48 (2) ◽

Author(s):

Pooja Jain ◽

◽

Dr. Kavita Taneja ◽

Dr. Harmunish Taneja ◽

◽

...

Keyword(s):

Comparative Study ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Research Area ◽

Real World Applications ◽

Banking Education ◽

Active Research ◽

Active Research Area ◽

Computational Technology

Optical Character Recognition (OCR) is a very active research area in many challenging fields like pattern recognition, natural language processing (NLP), computer vision, biomedical informatics, machine learning (ML), and artificial intelligence (AI). This computational technology extracts the text in an editable format (MS Word/Excel, text files, etc.) from PDF files, scanned or hand-written documents, images (photographs, advertisements, and alike), etc. for further processing and has been utilized in many real-world applications including banking, education, insurance, finance, healthcare and keyword-based search in documents, etc. Many OCR toolsets are available under various categories, including open-source, proprietary, and online services. This research paper provides a comparative study of various OCR toolsets considering a variety of parameters.

Download Full-text

Optical character recognition errors and their effects on natural language processing

International Journal on Document Analysis and Recognition (IJDAR) ◽

10.1007/s10032-009-0094-8 ◽

2009 ◽

Vol 12 (3) ◽

pp. 141-151 ◽

Cited By ~ 26

Author(s):

Daniel Lopresti

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character ◽

Recognition Errors

Download Full-text

Artificial Intelligence in Metrology Data Collection

10.51843/wsproceedings.2021.03 ◽

2021 ◽

Author(s):

Michael Schwartz ◽

Keyword(s):

Artificial Intelligence ◽

Data Collection ◽

Character Recognition ◽

Optical Character Recognition ◽

Lessons Learned ◽

Continuous Learning ◽

Automate Data ◽

Optical Character ◽

Automate Data Collection ◽

Continual Learning

Many companies have tried to automate data collection for handheld Digital Multimeters (DMM) using Optical Character Recognition (OCR). Only recently have companies tried to perform this task using Artificial Intelligence (AI) technology, Cal Lab Solutions being one of them in 2020. But when we developed our first prototype application, we discovered the difficulties of getting a good value with every measurement and test point.A year later, lessons learned and equipped with better software, this paper is a continuation of that AI project. In Beta-,1 we learned the difficulties of AI reading segmented displays. There are no pre-trained models for this type of display, so we needed to train a model. This required the testing of thousands of images, so we changed the scope of the project to a continual learning AI project. This paper will cover how we built our continuous learning AI model to show how any lab with a webcam can start automating those handheld DMMS with software that gets smarter over time.

Download Full-text

Machine Learning Techniques Application

Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing ◽

10.4018/978-1-7998-5339-8.ch068 ◽

2021 ◽

pp. 1396-1417

Author(s):

Karthikeyan P. ◽

Karunakaran Velswamy ◽

Pon Harshavardhanan ◽

Rajagopal R. ◽

JeyaKrishnan V. ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Modern World ◽

Interdisciplinary Field ◽

Sound Image ◽

Learning Techniques

Machine learning is the part of artificial intelligence that makes machines learn without being expressly programmed. Machine learning application built the modern world. Machine learning techniques are mainly classified into three techniques: supervised, unsupervised, and semi-supervised. Machine learning is an interdisciplinary field, which can be joined in different areas including science, business, and research. Supervised techniques are applied in agriculture, email spam, malware filtering, online fraud detection, optical character recognition, natural language processing, and face detection. Unsupervised techniques are applied in market segmentation and sentiment analysis and anomaly detection. Deep learning is being utilized in sound, image, video, time series, and text. This chapter covers applications of various machine learning techniques, social media, agriculture, and task scheduling in a distributed system.

Download Full-text

BIM in the Saudi Arabian construction industry: state of the art, benefit and barriers

International Journal of Building Pathology and Adaptation ◽

10.1108/ijbpa-08-2018-0065 ◽

2019 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 1

Author(s):

Abdullah Al-Yami ◽

Muizz O. Sanni-Anibire

Keyword(s):

Saudi Arabia ◽

Construction Industry ◽

Construction Projects ◽

State Of The Art ◽

Pilot Project ◽

Lessons Learned ◽

Information Modeling ◽

Perceived Benefits ◽

Content Type

Purpose Although there is a boom in the construction industry in the Kingdom of Saudi Arabia (KSA), it is yet to fully adopt building information modeling (BIM), which has received a lot of attention in the US, UK and Australian construction industries. Thus, the purpose of this paper is to provide the current state of the art in BIM implementation in Saudi Arabia, as well as perceived benefits and barriers through a case study. Design/methodology/approach A broad overview of BIM, the construction industry in KSA and the research and implementation of BIM in KSA was presented in this study. The research further established the perceived benefits and barriers of BIM implementation through a case study of a local AEC firm. A questionnaire survey was used to obtain lessons learned from the BIM team of the pilot project and was further analyzed using the RII approach. Findings The study’s findings include the lack of policy initiatives in KSA to enforce BIM in the construction industry, as well as the lack of sufficient research in the domain of BIM in KSA. Furthermore, the case study also revealed that the most important benefit of BIM adoption is “detection of inter-disciplinary conflicts in the drawings to reduce error, maintain design intent, control quality and speed up communication,” whereas the most important barrier is “the need for re-engineering many construction projects for successful transition towards BIM.” Originality/value The study provides a background for enhanced research towards the implementation of BIM in Saudi Arabia and also demonstrates the potential benefits and barriers in BIM implementation.

Download Full-text

Toward the optimized crowdsourcing strategy for OCR post-correction

Aslib Journal of Information Management ◽

10.1108/ajim-07-2019-0189 ◽

2019 ◽

Vol 72 (2) ◽

pp. 179-197

Author(s):

Omri Suissa ◽

Avshalom Elmalech ◽

Maayan Zhitomirsky-Geffet

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Stage Structure ◽

Historical Documents ◽

Two Phase ◽

Content Type ◽

Historical Texts ◽

Efficiency Measures ◽

Historical Text ◽

Amazon's Mechanical Turk

Purpose Digitization of historical documents is a challenging task in many digital humanities projects. A popular approach for digitization is to scan the documents into images, and then convert images into text using optical character recognition (OCR) algorithms. However, the outcome of OCR processing of historical documents is usually inaccurate and requires post-processing error correction. The purpose of this paper is to investigate how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which crowdsourcing methodology is the most effective in different scenarios and for various research objectives. Design/methodology/approach A series of experiments with different micro-task’s structures and text lengths were conducted with 753 workers on the Amazon’s Mechanical Turk platform. The workers had to fix OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures were devised. Findings The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-size) and the optimal structure of the experiment is two phase with a scanned image. In terms of efficiency, the best results were obtained when using longer text in the single-stage structure with no image. Practical implications The study provides practical recommendations to researchers on how to build the optimal crowdsourcing task for OCR post-correction. The developed methodology can also be utilized to create golden standard historical texts for automatic OCR post-correction. Originality/value This is the first attempt to systematically investigate the influence of various factors on crowdsourcing-based OCR post-correction and propose an optimal strategy for this process.

Download Full-text

Introducing lesson study in promoting a new mathematics curriculum in Irish post-primary schools

International Journal for Lesson and Learning Studies ◽

10.1108/ijlls-09-2013-0050 ◽

2014 ◽

Vol 3 (3) ◽

pp. 236-251 ◽

Cited By ~ 7

Author(s):

Anne Brosnan

Keyword(s):

School Leadership ◽

Mathematics Teachers ◽

Lesson Study ◽

Primary Schools ◽

Mathematics Curriculum ◽

Continuing Professional Development ◽

Pilot Project ◽

Lessons Learned ◽

Content Type ◽

Initial Resistance

Purpose – The purpose of this paper is to investigate and review how the practices of Lesson Study fare in enhancing the professional capabilities of mathematics teachers when introduced as part of a pilot project in reforming the post-primary mathematics curriculum in Ireland. Design/methodology/approach – Totally, 250 mathematics teachers teaching Junior and Senior Cycle mathematics in 24 post-primary schools constitute the population of this study. The schools which participated are representative of the range of all post-primary schools in Ireland. Findings – Lesson Study has an important role to play in the continuing professional development of teachers in the 24 post-primary schools and beyond in Ireland. An investigation of the maths teachers’ engagement with Lesson Study reveals some considerable initial resistance. Reasons for this resistance are examined and the lessons learned from the steps taken to deal with this are reviewed. Lesson Study is an innovation that teachers need to understand deeply and to practice regularly through mutual support if they are to avail of it fruitfully. Accordingly, further approaches need to be explored, not least the important role of school leadership, to adapt Lesson Study more fully and more productively to the professional cultures of teaching in Ireland. Originality/value – An analytic and evaluative account of the challenges and complexities involved in introducing Lesson Study to post-primary schools in Ireland is presented for the first time.

Download Full-text