A Workflow for Data Extraction from Digitized Herbarium Specimens

Biodiversity Information Science and Standards ◽

10.3897/biss.3.35190 ◽

2019 ◽

Vol 3 ◽

Author(s):

Sohaib Younis ◽

Marco Schmidt ◽

Bernhard Seeger ◽

Thomas Hickler ◽

Claus Weiland

Keyword(s):

Plant Material ◽

Data Extraction ◽

Species Recognition ◽

Training Data ◽

Herbarium Specimens ◽

Plant Organs ◽

Unseen Data ◽

Working Groups ◽

The Many ◽

Material Extraction

Based on own work on species and trait recognition and complementary studies from other working groups, we present a workflow for data extraction from digitized herbarium specimens using convolutional neural networks. Digitized herbarium sheets contain: preserved plant material as well as additional objects: the label containing information on the collection event, annotations such as revision labels, or notes on material extraction, identifiers such as barcodes or numbers, envelopes for loose plant material and often scale bars and color charts used in the digitization process. preserved plant material as well as additional objects: the label containing information on the collection event, annotations such as revision labels, or notes on material extraction, identifiers such as barcodes or numbers, envelopes for loose plant material and often scale bars and color charts used in the digitization process. In order to treat these objects appropriately, segmentation techniques (Triki et al. 2018) will be applied to localize and identify the different kinds of objects for specific treatments. Detecting presence of plant organs such as leaves, flowers or fruits is already a first step in data extraction potentially useful for phenological studies. Plant organs will be subject to routines for quantitative (Gaikwad et al. 2018) and qualitative (Younis et al. 2018) trait recognition routines. Text-based objects can be treated as described by Kirchhoff et al. 2018, using OCR techniques and considering the many collection-specific terms and abbreviations as described in Schröder 2019. Additionally, species recognition (Younis et al. 2018) will be applied in order to help further identification of incompletely identified collection items or to detect possible misidentifications. All steps described above need sufficient training data including labelling that may be obtained from collection metadata and trait databases. In order to deal with new incoming digitized collections, unseen data or categories, we propose implementation of a new Deep Learning approach, so-called Lifelong Learning: Past knowledge of the network is dynamically saved in latent space using autoencoder and generatively replayed while the network is trained on new tasks which enables it to solve complex image processing tasks without forgetting former knowledge while incrementally learning new classes and knowledge.

Download Full-text

Generating Masks for Image Segmentation in Digitized Herbarium Specimens

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37479 ◽

2019 ◽

Vol 3 ◽

Author(s):

Alexander White ◽

Rebecca Dikow ◽

Makinnon Baugh ◽

Abby Jenkins ◽

Paul Frandsen

Keyword(s):

Image Segmentation ◽

Background Noise ◽

Plant Material ◽

Training Data ◽

Herbarium Specimens ◽

Complex Information ◽

Museum Data ◽

Specimen Material ◽

Herbarium Sheet ◽

Pattern Shape

Digitized herbarium images contain complex information unrelated to the shape and color of the specimens represented within them. This information can contribute a substantial amount of noise if one is to use the image as a proxy for pattern, shape, or color of the specimen. Image segmentation, whereby the specimen material is partitioned from the background (e.g., herbarium sheet, label, color ramp), offers one possible solution, yet training data for image segmentation of herbarium specimens is nonexistent. We present a pipeline for generating training data for image segmentation tasks along with a novel dataset of highly resolved image masks segmenting plant material from background noise. This dataset can be used to train neural networks to segment plant material in herbarium sheets more generally, and our method is applicable to other museum data sources where masking may be useful for quantitative analysis of patterns and shapes

Download Full-text

Microwave drying of plant material for herbarium specimens and genetic analysis

Taxon ◽

10.12705/624.33 ◽

2013 ◽

Vol 62 (4) ◽

pp. 790-797 ◽

Cited By ~ 3

Author(s):

Tonya A. Lander ◽

Bernadeta Dadonaite ◽

Alex K. Monro

Keyword(s):

Genetic Analysis ◽

Plant Material ◽

Microwave Drying ◽

Herbarium Specimens

Download Full-text

A review: preprocessing techniques and data augmentation for sentiment analysis

Computational Social Networks ◽

10.1186/s40649-020-00080-x ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Huu-Thanh Duong ◽

Tram-Anh Nguyen-Thi

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Supervised Learning ◽

Data Augmentation ◽

Original Data ◽

Training Data ◽

Unseen Data ◽

Augmentation Techniques ◽

User Intervention

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Download Full-text

Detection and annotation of plant organs from digitised herbarium scans using deep learning

Biodiversity Data Journal ◽

10.3897/bdj.8.e57090 ◽

2020 ◽

Vol 8 ◽

Author(s):

Sohaib Younis ◽

Marco Schmidt ◽

Claus Weiland ◽

Stefan Dressler ◽

Bernhard Seeger ◽

...

Keyword(s):

Deep Learning ◽

Automatic Recognition ◽

Herbarium Specimens ◽

Plant Organs ◽

Detection Model ◽

Plant Organ ◽

Large Numbers ◽

Bounding Boxes ◽

Advanced Computer ◽

Extract Information

As herbarium specimens are increasingly becoming digitised and accessible in online repositories, advanced computer vision techniques are being used to extract information from them. The presence of certain plant organs on herbarium sheets is useful information in various scientific contexts and automatic recognition of these organs will help mobilise such information. In our study, we use deep learning to detect plant organs on digitised herbarium specimens with Faster R-CNN. For our experiment, we manually annotated hundreds of herbarium scans with thousands of bounding boxes for six types of plant organs and used them for training and evaluating the plant organ detection model. The model worked particularly well on leaves and stems, while flowers were also present in large numbers in the sheets, but were not equally well recognised.

Download Full-text

Effective Training Data Extraction Method to Improve Influenza Outbreak Prediction from Online News Articles (Preprint)

JMIR Medical Informatics ◽

10.2196/23305 ◽

2020 ◽

Author(s):

Beakcheol Jang ◽

Inhwan Kim ◽

Jong Wook Kim

Keyword(s):

Extraction Method ◽

Data Extraction ◽

Online News ◽

Training Data ◽

Influenza Outbreak ◽

Effective Training

Download Full-text

An Iterative Multi-Source Mutual Knowledge Transfer Framework for Machine Reading Comprehension

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/525 ◽

2020 ◽

Cited By ~ 1

Author(s):

Xin Liu ◽

Kai Liu ◽

Xiang Li ◽

Jinsong Su ◽

Yubin Ge ◽

...

Keyword(s):

Reading Comprehension ◽

Knowledge Transfer ◽

Training Data ◽

Target Domain ◽

Domain Specific ◽

Mutual Knowledge ◽

Benchmark Datasets ◽

Knowledge Distillation ◽

The Many ◽

Machine Reading

The lack of sufficient training data in many domains, poses a major challenge to the construction of domain-specific machine reading comprehension (MRC) models with satisfying performance. In this paper, we propose a novel iterative multi-source mutual knowledge transfer framework for MRC. As an extension of the conventional knowledge transfer with one-to-one correspondence, our framework focuses on the many-to-many mutual transfer, which involves synchronous executions of multiple many-to-one transfers in an iterative manner.Specifically, to update a target-domain MRC model, we first consider other domain-specific MRC models as individual teachers, and employ knowledge distillation to train a multi-domain MRC model, which is differentially required to fit the training data and match the outputs of these individual models according to their domain-level similarities to the target domain. After being initialized by the multi-domain MRC model, the target-domain MRC model is fine-tuned to match both its training data and the output of its previous best model simultaneously via knowledge distillation. Compared with previous approaches, our framework can continuously enhance all domain-specific MRC models by enabling each model to iteratively and differentially absorb the domain-shared knowledge from others. Experimental results and in-depth analyses on several benchmark datasets demonstrate the effectiveness of our framework.

Download Full-text

Empirical Evaluation of the Effect of Optimization and Regularization Techniques on the Generalization Performance of Deep Convolutional Neural Network

Applied Sciences ◽

10.3390/app10217817 ◽

2020 ◽

Vol 10 (21) ◽

pp. 7817

Author(s):

Ivana Marin ◽

Ana Kuzmanic Skelin ◽

Tamara Grujic

Keyword(s):

Neural Network ◽

Deep Learning ◽

Convolutional Neural Network ◽

Empirical Evaluation ◽

Training Data ◽

Generalization Performance ◽

Deep Model ◽

Unseen Data ◽

Regularization Techniques ◽

Learning Architectures

The main goal of any classification or regression task is to obtain a model that will generalize well on new, previously unseen data. Due to the recent rise of deep learning and many state-of-the-art results obtained with deep models, deep learning architectures have become one of the most used model architectures nowadays. To generalize well, a deep model needs to learn the training data well without overfitting. The latter implies a correlation of deep model optimization and regularization with generalization performance. In this work, we explore the effect of the used optimization algorithm and regularization techniques on the final generalization performance of the model with convolutional neural network (CNN) architecture widely used in the field of computer vision. We give a detailed overview of optimization and regularization techniques with a comparative analysis of their performance with three CNNs on the CIFAR-10 and Fashion-MNIST image datasets.

Download Full-text

CHIRPS: Explaining random forest classification

Artificial Intelligence Review ◽

10.1007/s10462-020-09833-6 ◽

2020 ◽

Vol 53 (8) ◽

pp. 5747-5788

Author(s):

Julian Hatwell ◽

Mohamed Medhat Gaber ◽

R. Muhammad Atif Azad

Keyword(s):

Random Forest ◽

Pattern Mining ◽

Frequent Pattern Mining ◽

Training Data ◽

Frequent Pattern ◽

Data Sets ◽

Random Forest Classification ◽

Human In The Loop ◽

Forest Classification ◽

Unseen Data

Abstract Modern machine learning methods typically produce “black box” models that are opaque to interpretation. Yet, their demand has been increasing in the Human-in-the-Loop processes, that is, those processes that require a human agent to verify, approve or reason about the automated decisions before they can be applied. To facilitate this interpretation, we propose Collection of High Importance Random Path Snippets (CHIRPS); a novel algorithm for explaining random forest classification per data instance. CHIRPS extracts a decision path from each tree in the forest that contributes to the majority classification, and then uses frequent pattern mining to identify the most commonly occurring split conditions. Then a simple, conjunctive form rule is constructed where the antecedent terms are derived from the attributes that had the most influence on the classification. This rule is returned alongside estimates of the rule’s precision and coverage on the training data along with counter-factual details. An experimental study involving nine data sets shows that classification rules returned by CHIRPS have a precision at least as high as the state of the art when evaluated on unseen data (0.91–0.99) and offer a much greater coverage (0.04–0.54). Furthermore, CHIRPS uniquely controls against under- and over-fitting solutions by maximising novel objective functions that are better suited to the local (per instance) explanation setting.

Download Full-text

The Real-Time Mobile Application for Classifying of Endangered Parrot Species Using the CNN Models Based on Transfer Learning

Mobile Information Systems ◽

10.1155/2020/1475164 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Daegyu Choe ◽

Eunjeong Choi ◽

Dong Keun Kim

Keyword(s):

Real Time ◽

Transfer Learning ◽

Mobile Device ◽

Mobile Application ◽

Feature Recognition ◽

Species Recognition ◽

Color Similarity ◽

Parrot Species ◽

The Many ◽

Google Search

Among the many deep learning methods, the convolutional neural network (CNN) model has an excellent performance in image recognition. Research on identifying and classifying image datasets using CNN is ongoing. Animal species recognition and classification with CNN is expected to be helpful for various applications. However, sophisticated feature recognition is essential to classify quasi-species with similar features, such as the quasi-species of parrots that have a high color similarity. The purpose of this study is to develop a vision-based mobile application to classify endangered parrot species using an advanced CNN model based on transfer learning (some parrots have quite similar colors and shapes). We acquired the images in two ways: collecting them directly from the Seoul Grand Park Zoo and crawling them using the Google search. Subsequently, we have built advanced CNN models with transfer learning and trained them using the data. Next, we converted one of the fully trained models into a file for execution on mobile devices and created the Android package files. The accuracy was measured for each of the eight CNN models. The overall accuracy for the camera of the mobile device was 94.125%. For certain species, the accuracy of recognition was 100%, with the required time of only 455 ms. Our approach helps to recognize the species in real time using the camera of the mobile device. Applications will be helpful for the prevention of smuggling of endangered species in the customs clearance area.

Download Full-text

Occurrence, Composition and Formation of Ruppia, Widgeon Grass, balls in Saskatchewan Lakes

The Canadian Field-Naturalist ◽

10.22621/cfn.v119i1.89 ◽

2005 ◽

Vol 119 (1) ◽

pp. 114 ◽

Cited By ~ 1

Author(s):

Randy W. Olson ◽

Josef K. Schmutz ◽

Theodore Hammer

Keyword(s):

Plant Material ◽

Saline Lake ◽

Vascular Plant ◽

Herbarium Specimens ◽

Ruppia Maritima ◽

Washing Machine ◽

Invertebrate Animal ◽

Turtle Grass ◽

Aquatic Vascular Plant ◽

Near Shore

Widgeon Grass (Ruppia maritima) is an aquatic vascular plant (Ruppiaceae) which has been the source for rare balls of plant material found at the shores of lakes on four continents. In North America, the lakes involved were in North Dakota, Oregon, and now northern and southern Saskatchewan. The formation of the balls has not been observed in nature, but similar balls have been produced in other studies with Posidonia or Turtle Grass (Hydrocharitaceae) fibers under the wavelike action in a washing machine. Our samples are from a saline lake in southern Saskatchewan (49°N), and an over 40-year-old sample from an unknown lake north of the boreal transition zone (52°N). Comparisons of the plant material with herbarium specimens confirm that the balls are almost entirely comprised of Ruppia maritima, with minor items including invertebrate animal parts, sand pebbles and feathers. The context in which the material was found is consistent with the proposition that they are formed by Ruppia inflorescences breaking apart, drifting to near shore due to wind and being rolled into balls by wave action.

Download Full-text