scholarly journals A Workflow for Data Extraction from Digitized Herbarium Specimens

Author(s):  
Sohaib Younis ◽  
Marco Schmidt ◽  
Bernhard Seeger ◽  
Thomas Hickler ◽  
Claus Weiland

Based on own work on species and trait recognition and complementary studies from other working groups, we present a workflow for data extraction from digitized herbarium specimens using convolutional neural networks. Digitized herbarium sheets contain: preserved plant material as well as additional objects: the label containing information on the collection event, annotations such as revision labels, or notes on material extraction, identifiers such as barcodes or numbers, envelopes for loose plant material and often scale bars and color charts used in the digitization process. preserved plant material as well as additional objects: the label containing information on the collection event, annotations such as revision labels, or notes on material extraction, identifiers such as barcodes or numbers, envelopes for loose plant material and often scale bars and color charts used in the digitization process. In order to treat these objects appropriately, segmentation techniques (Triki et al. 2018) will be applied to localize and identify the different kinds of objects for specific treatments. Detecting presence of plant organs such as leaves, flowers or fruits is already a first step in data extraction potentially useful for phenological studies. Plant organs will be subject to routines for quantitative (Gaikwad et al. 2018) and qualitative (Younis et al. 2018) trait recognition routines. Text-based objects can be treated as described by Kirchhoff et al. 2018, using OCR techniques and considering the many collection-specific terms and abbreviations as described in Schröder 2019. Additionally, species recognition (Younis et al. 2018) will be applied in order to help further identification of incompletely identified collection items or to detect possible misidentifications. All steps described above need sufficient training data including labelling that may be obtained from collection metadata and trait databases. In order to deal with new incoming digitized collections, unseen data or categories, we propose implementation of a new Deep Learning approach, so-called Lifelong Learning: Past knowledge of the network is dynamically saved in latent space using autoencoder and generatively replayed while the network is trained on new tasks which enables it to solve complex image processing tasks without forgetting former knowledge while incrementally learning new classes and knowledge.

Author(s):  
Alexander White ◽  
Rebecca Dikow ◽  
Makinnon Baugh ◽  
Abby Jenkins ◽  
Paul Frandsen

Digitized herbarium images contain complex information unrelated to the shape and color of the specimens represented within them. This information can contribute a substantial amount of noise if one is to use the image as a proxy for pattern, shape, or color of the specimen. Image segmentation, whereby the specimen material is partitioned from the background (e.g., herbarium sheet, label, color ramp), offers one possible solution, yet training data for image segmentation of herbarium specimens is nonexistent. We present a pipeline for generating training data for image segmentation tasks along with a novel dataset of highly resolved image masks segmenting plant material from background noise. This dataset can be used to train neural networks to segment plant material in herbarium sheets more generally, and our method is applicable to other museum data sources where masking may be useful for quantitative analysis of patterns and shapes


Taxon ◽  
2013 ◽  
Vol 62 (4) ◽  
pp. 790-797 ◽  
Author(s):  
Tonya A. Lander ◽  
Bernadeta Dadonaite ◽  
Alex K. Monro

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Huu-Thanh Duong ◽  
Tram-Anh Nguyen-Thi

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.


2020 ◽  
Vol 8 ◽  
Author(s):  
Sohaib Younis ◽  
Marco Schmidt ◽  
Claus Weiland ◽  
Stefan Dressler ◽  
Bernhard Seeger ◽  
...  

As herbarium specimens are increasingly becoming digitised and accessible in online repositories, advanced computer vision techniques are being used to extract information from them. The presence of certain plant organs on herbarium sheets is useful information in various scientific contexts and automatic recognition of these organs will help mobilise such information. In our study, we use deep learning to detect plant organs on digitised herbarium specimens with Faster R-CNN. For our experiment, we manually annotated hundreds of herbarium scans with thousands of bounding boxes for six types of plant organs and used them for training and evaluating the plant organ detection model. The model worked particularly well on leaves and stems, while flowers were also present in large numbers in the sheets, but were not equally well recognised.


Author(s):  
Xin Liu ◽  
Kai Liu ◽  
Xiang Li ◽  
Jinsong Su ◽  
Yubin Ge ◽  
...  

The lack of sufficient training data in many domains, poses a major challenge to the construction of domain-specific machine reading comprehension (MRC) models with satisfying performance. In this paper, we propose a novel iterative multi-source mutual knowledge transfer framework for MRC. As an extension of the conventional knowledge transfer with one-to-one correspondence, our framework focuses on the many-to-many mutual transfer, which involves synchronous executions of multiple many-to-one transfers in an iterative manner.Specifically, to update a target-domain MRC model, we first consider other domain-specific MRC models as individual teachers, and employ knowledge distillation to train a multi-domain MRC model, which is differentially required to fit the training data and match the outputs of these individual models according to their domain-level similarities to the target domain. After being initialized by the multi-domain MRC model, the target-domain MRC model is fine-tuned to match both its training data and the output of its previous best model simultaneously via knowledge distillation. Compared with previous approaches, our framework can continuously enhance all domain-specific MRC models by enabling each model to iteratively and differentially absorb the domain-shared knowledge from others. Experimental results and in-depth analyses on several benchmark datasets demonstrate the effectiveness of our framework.


2020 ◽  
Vol 10 (21) ◽  
pp. 7817
Author(s):  
Ivana Marin ◽  
Ana Kuzmanic Skelin ◽  
Tamara Grujic

The main goal of any classification or regression task is to obtain a model that will generalize well on new, previously unseen data. Due to the recent rise of deep learning and many state-of-the-art results obtained with deep models, deep learning architectures have become one of the most used model architectures nowadays. To generalize well, a deep model needs to learn the training data well without overfitting. The latter implies a correlation of deep model optimization and regularization with generalization performance. In this work, we explore the effect of the used optimization algorithm and regularization techniques on the final generalization performance of the model with convolutional neural network (CNN) architecture widely used in the field of computer vision. We give a detailed overview of optimization and regularization techniques with a comparative analysis of their performance with three CNNs on the CIFAR-10 and Fashion-MNIST image datasets.


2020 ◽  
Vol 53 (8) ◽  
pp. 5747-5788
Author(s):  
Julian Hatwell ◽  
Mohamed Medhat Gaber ◽  
R. Muhammad Atif Azad

Abstract Modern machine learning methods typically produce “black box” models that are opaque to interpretation. Yet, their demand has been increasing in the Human-in-the-Loop processes, that is, those processes that require a human agent to verify, approve or reason about the automated decisions before they can be applied. To facilitate this interpretation, we propose Collection of High Importance Random Path Snippets (CHIRPS); a novel algorithm for explaining random forest classification per data instance. CHIRPS extracts a decision path from each tree in the forest that contributes to the majority classification, and then uses frequent pattern mining to identify the most commonly occurring split conditions. Then a simple, conjunctive form rule is constructed where the antecedent terms are derived from the attributes that had the most influence on the classification. This rule is returned alongside estimates of the rule’s precision and coverage on the training data along with counter-factual details. An experimental study involving nine data sets shows that classification rules returned by CHIRPS have a precision at least as high as the state of the art when evaluated on unseen data (0.91–0.99) and offer a much greater coverage (0.04–0.54). Furthermore, CHIRPS uniquely controls against under- and over-fitting solutions by maximising novel objective functions that are better suited to the local (per instance) explanation setting.


2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Daegyu Choe ◽  
Eunjeong Choi ◽  
Dong Keun Kim

Among the many deep learning methods, the convolutional neural network (CNN) model has an excellent performance in image recognition. Research on identifying and classifying image datasets using CNN is ongoing. Animal species recognition and classification with CNN is expected to be helpful for various applications. However, sophisticated feature recognition is essential to classify quasi-species with similar features, such as the quasi-species of parrots that have a high color similarity. The purpose of this study is to develop a vision-based mobile application to classify endangered parrot species using an advanced CNN model based on transfer learning (some parrots have quite similar colors and shapes). We acquired the images in two ways: collecting them directly from the Seoul Grand Park Zoo and crawling them using the Google search. Subsequently, we have built advanced CNN models with transfer learning and trained them using the data. Next, we converted one of the fully trained models into a file for execution on mobile devices and created the Android package files. The accuracy was measured for each of the eight CNN models. The overall accuracy for the camera of the mobile device was 94.125%. For certain species, the accuracy of recognition was 100%, with the required time of only 455 ms. Our approach helps to recognize the species in real time using the camera of the mobile device. Applications will be helpful for the prevention of smuggling of endangered species in the customs clearance area.


2005 ◽  
Vol 119 (1) ◽  
pp. 114 ◽  
Author(s):  
Randy W. Olson ◽  
Josef K. Schmutz ◽  
Theodore Hammer

Widgeon Grass (Ruppia maritima) is an aquatic vascular plant (Ruppiaceae) which has been the source for rare balls of plant material found at the shores of lakes on four continents. In North America, the lakes involved were in North Dakota, Oregon, and now northern and southern Saskatchewan. The formation of the balls has not been observed in nature, but similar balls have been produced in other studies with Posidonia or Turtle Grass (Hydrocharitaceae) fibers under the wavelike action in a washing machine. Our samples are from a saline lake in southern Saskatchewan (49°N), and an over 40-year-old sample from an unknown lake north of the boreal transition zone (52°N). Comparisons of the plant material with herbarium specimens confirm that the balls are almost entirely comprised of Ruppia maritima, with minor items including invertebrate animal parts, sand pebbles and feathers. The context in which the material was found is consistent with the proposition that they are formed by Ruppia inflorescences breaking apart, drifting to near shore due to wind and being rolled into balls by wave action.


Sign in / Sign up

Export Citation Format

Share Document