A Pipeline for Deep Learning with Specimen Images in iDigBio - Applying and Generalizing an Examination of Mercury Use in Preparing Herbarium Specimens

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25699 ◽

2018 ◽

Vol 2 ◽

pp. e25699

Author(s):

Matthew Collins ◽

Gaurav Yeole ◽

Paul Frandsen ◽

Rebecca Dikow ◽

Sylvia Orli ◽

...

Keyword(s):

Deep Learning ◽

Storage System ◽

Research Result ◽

Open Data ◽

Data Access ◽

Herbarium Specimens ◽

Large Numbers ◽

The Neural Networks ◽

Graphical Processing ◽

Advanced Computing

iDigBio Matsunaga et al. 2013 currently references over 22 million media files, and stores approximately 120 terabytes worth of those media files co-located with our compute infrastructure. Using these images for scientific research is a logistical and technical challenge. Transferring large numbers of images requires programming skill, bandwidth, and storage space. While simple image transformations such as resizing and generating histograms are approachable on desktops and laptops, the neural networks commonly used for learning from images require server-based graphical processing units (GPUs) to run effectively. Using the GUODA (Global Unified Open Data Access) infrastructure, we have built a model pipeline for applying user-defined processing to any subset of the images stored in iDigBio. This pipeline is run on servers located in the Advanced Computing and Information Systems lab (ACIS) alongside the iDigBio storage system. We use Apache Spark, the Hadoop File System (HDFS), and Mesos to perform the processing. We have placed a Jupyter notebook server in front of this architecture which provides an easy environment with deep learning libraries for Python already loaded for end users to write their own models. Users can access the stored data and images and manipulate them according to their requirements and make their work publicly available on GitHub. As an example of how this pipeline can be used in research, we applied a neural network developed at the Smithsonian Institution to identify herbarium sheets that were prepared with hazardous mercury containing solutions Schuettpelz et al. 2017. The model was trained with Smithsonian resources on their images and transferred to the GUODA infrastructure hosted at ACIS which also houses iDigBio. We then applied this model to additional images in iDigBio to classify them to illustrate the application of these techniques to broad image corpora potentially to notify other data publishers of contamination. We present the results of this classification not as a verified research result, but as an example of the collaborative and scalable workflows this pipeline and infrastructure enable.

Download Full-text

Detection and annotation of plant organs from digitised herbarium scans using deep learning

Biodiversity Data Journal ◽

10.3897/bdj.8.e57090 ◽

2020 ◽

Vol 8 ◽

Author(s):

Sohaib Younis ◽

Marco Schmidt ◽

Claus Weiland ◽

Stefan Dressler ◽

Bernhard Seeger ◽

...

Keyword(s):

Deep Learning ◽

Automatic Recognition ◽

Herbarium Specimens ◽

Plant Organs ◽

Detection Model ◽

Plant Organ ◽

Large Numbers ◽

Bounding Boxes ◽

Advanced Computer ◽

Extract Information

As herbarium specimens are increasingly becoming digitised and accessible in online repositories, advanced computer vision techniques are being used to extract information from them. The presence of certain plant organs on herbarium sheets is useful information in various scientific contexts and automatic recognition of these organs will help mobilise such information. In our study, we use deep learning to detect plant organs on digitised herbarium specimens with Faster R-CNN. For our experiment, we manually annotated hundreds of herbarium scans with thousands of bounding boxes for six types of plant organs and used them for training and evaluating the plant organ detection model. The model worked particularly well on leaves and stems, while flowers were also present in large numbers in the sheets, but were not equally well recognised.

Download Full-text

Aplikasi Online Mobile Repository System

ComTech Computer Mathematics and Engineering Applications ◽

10.21512/comtech.v2i2.2829 ◽

2011 ◽

Vol 2 (2) ◽

pp. 798

Author(s):

Michael Yoseph Ricky

Keyword(s):

Information Technology ◽

Data Storage ◽

Storage System ◽

Data Access ◽

Field Studies ◽

Decision Making Process ◽

Design Studies ◽

Large Numbers ◽

Data Storage System ◽

And Storage

In order to support the decision making process effectively and efficiently, a lot of companies invest in information technology for data management and storage. The information technology is emphasized in data storage of daily transactions in large numbers in a medium, so that data can be easily processed. Online Mobile Repository System is a system that uses a temporary storage system which is practical and can be accessed online. Thus, the system simplifies data access and organization anywhere and anytime. The design of this system uses the methods of literature review, interviews, field studies, and combined design studies. The research results in a data storage system which is organized, safe and structured to support the meeting process.

Download Full-text

A semantic relationship mining method among disorders, genes, and drugs from different biomedical datasets

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01274-z ◽

2020 ◽

Vol 20 (S4) ◽

Author(s):

Li Zhang ◽

Jiamei Hu ◽

Qianzhi Xu ◽

Fang Li ◽

Guozheng Rao ◽

...

Keyword(s):

Drug Repositioning ◽

International Workshop ◽

Storage System ◽

Open Data ◽

Semantic Relationship ◽

Integration Algorithm ◽

Semantic Relationships ◽

Semantic Web Technology ◽

Large Numbers ◽

Description Framework

Abstract Background Semantic web technology has been applied widely in the biomedical informatics field. Large numbers of biomedical datasets are available online in the resource description framework (RDF) format. Semantic relationship mining among genes, disorders, and drugs is widely used in, for example, precision medicine and drug repositioning. However, most of the existing studies focused on a single dataset. It is not easy to find the most current relationships among disorder-gene-drug relationships since the relationships are distributed in heterogeneous datasets. How to mine their semantic relationships from different biomedical datasets is an important issue. Methods First, a variety of biomedical datasets were converted into RDF triple data; then, multisource biomedical datasets were integrated into a storage system using a data integration algorithm. Second, nine query patterns among genes, disorders, and drugs from different biomedical datasets were designed. Third, the gene-disorder-drug semantic relationship mining algorithm is presented. This algorithm can query the relationships among various entities from different datasets. Results and conclusions We focused on mining the putative and the most current disorder-gene-drug relationships about Parkinson’s disease (PD). The results demonstrate that our method has significant advantages in mining and integrating multisource heterogeneous biomedical datasets. Twenty-five new relationships among the genes, disorders, and drugs were mined from four different datasets. The query results showed that most of them came from different datasets. The precision of the method increased by 2.51% compared to that of the multisource linked open data fusion method presented in the 4th International Workshop on Semantics-Powered Data Mining and Analytics (SEPDA 2019). Moreover, the number of query results increased by 7.7%, and the number of correct queries increased by 9.5%.

Download Full-text

Infrastructure for Long-term Preservation and OCR Analysis of Herbarium Images

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37563 ◽

2019 ◽

Vol 3 ◽

Author(s):

Nicolas Cazenave

Keyword(s):

Character Recognition ◽

Optical Character Recognition ◽

Open Data ◽

Herbarium Specimens ◽

Digital Repository ◽

Long Term Storage ◽

Large Numbers ◽

Scanned Images ◽

Trusted Digital Repository

Herbaria hold large numbers of specimens: approximately 22 million herbarium specimens exist as botanical reference objects in Germany, 20 million in France and about 500 million worldwide. High resolution digital images of these specimens take up substantial bandwidth and disk space. New methods of extracting information from the specimen labels have been developed using OCR (Optical character recognition) techniques, but the exploitation of this technology for biological specimens is particularly complex due to the presence of biological material in the image with the text, the non-standard vocabularies, alongside the variation and age of the fonts. Much of the information is handwritten and natural handwriting pattern recognition is a less mature technology than OCR. Today, our system (eTDR-European Trusted digital Repository) provides the OCR technology (using Tesseract software) adapted to the requirements of herbarium specimen images and requires minimal installation in each institution. This is what we propose to make available to botanists with our portal. The goal for a museum is to be able to submit a large number of scanned images easily to a long-term archiving system in order to automatically obtain OCR texts and retrieve them by a full text search on an open data portal. Most of the images are provided for reuse through CC-BY licenses. In each case, the rights of reuse associated with the data are specified in associated metadata. This pilot was an opportunity to test the long-term storage service eTDR provided by CINES. The services (B2SAFE, B2Handle) developed by EUDAT were used to facilitate the transfer of data to the storage repository and to provide indexing services for access to that repository. This workflow that has been tested for the european project ICEDIG is presented as a poster: See the document (Suppl. material 1).

Download Full-text

U-Infuse: Democratization of Customizable Deep Learning for Object Detection

Sensors ◽

10.3390/s21082611 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2611

Author(s):

Andrew Shepley ◽

Greg Falzon ◽

Christopher Lawson ◽

Paul Meek ◽

Paul Kwan

Keyword(s):

Deep Learning ◽

Intellectual Property ◽

Object Detection ◽

Image Data ◽

Learning Technologies ◽

Training Data ◽

Learning Models ◽

Ecological Data ◽

Single Class ◽

Large Numbers

Image data is one of the primary sources of ecological data used in biodiversity conservation and management worldwide. However, classifying and interpreting large numbers of images is time and resource expensive, particularly in the context of camera trapping. Deep learning models have been used to achieve this task but are often not suited to specific applications due to their inability to generalise to new environments and inconsistent performance. Models need to be developed for specific species cohorts and environments, but the technical skills required to achieve this are a key barrier to the accessibility of this technology to ecologists. Thus, there is a strong need to democratize access to deep learning technologies by providing an easy-to-use software application allowing non-technical users to train custom object detectors. U-Infuse addresses this issue by providing ecologists with the ability to train customised models using publicly available images and/or their own images without specific technical expertise. Auto-annotation and annotation editing functionalities minimize the constraints of manually annotating and pre-processing large numbers of images. U-Infuse is a free and open-source software solution that supports both multiclass and single class training and object detection, allowing ecologists to access deep learning technologies usually only available to computer scientists, on their own device, customised for their application, without sharing intellectual property or sensitive data. It provides ecological practitioners with the ability to (i) easily achieve object detection within a user-friendly GUI, generating a species distribution report, and other useful statistics, (ii) custom train deep learning models using publicly available and custom training data, (iii) achieve supervised auto-annotation of images for further training, with the benefit of editing annotations to ensure quality datasets. Broad adoption of U-Infuse by ecological practitioners will improve ecological image analysis and processing by allowing significantly more image data to be processed with minimal expenditure of time and resources, particularly for camera trap images. Ease of training and use of transfer learning means domain-specific models can be trained rapidly, and frequently updated without the need for computer science expertise, or data sharing, protecting intellectual property and privacy.

Download Full-text

Deep learning-based aerial image segmentation with open data for disaster impact assessment

Neurocomputing ◽

10.1016/j.neucom.2020.02.139 ◽

2021 ◽

Vol 439 ◽

pp. 22-33 ◽

Cited By ~ 1

Author(s):

Ananya Gupta ◽

Simon Watson ◽

Hujun Yin

Keyword(s):

Image Segmentation ◽

Deep Learning ◽

Impact Assessment ◽

Open Data ◽

Aerial Image ◽

Disaster Impact

Download Full-text

The Vulnerability of the Neural Networks Against Adversarial Examples in Deep Learning Algorithms

2021 2nd International Conference on Computing and Data Science (CDS) ◽

10.1109/cds52072.2021.00057 ◽

2021 ◽

Author(s):

Rui Zhao

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Learning Algorithms ◽

Adversarial Examples ◽

The Neural Networks

Download Full-text

How Deep Learning Works: Inside the Neural Networks that Power Today's AI

IEEE Spectrum ◽

10.1109/mspec.2021.9563965 ◽

2021 ◽

Vol 58 (10) ◽

pp. 32-33

Author(s):

Samuel K. Moore ◽

David Schneider ◽

Eliza Strickland

Keyword(s):

Neural Networks ◽

Deep Learning ◽

The Neural Networks

Download Full-text

The ATLAS EventIndex for LHC Run 3

EPJ Web of Conferences ◽

10.1051/epjconf/202024504017 ◽

2020 ◽

Vol 245 ◽

pp. 04017

Author(s):

Dario Barberis ◽

Igor Aleksandrov ◽

Evgeny Alexandrov ◽

Zbigniew Baranowski ◽

Gancho Dimitrov ◽

...

Keyword(s):

Event Detection ◽

Storage System ◽

Data Access ◽

Use Cases ◽

Production Rates ◽

Global Event ◽

Storage Service ◽

Data Ingestion ◽

Core Storage ◽

Metadata Integration

The ATLAS EventIndex was designed in 2012-2013 to provide a global event catalogue and limited event-level metadata for ATLAS analysis groups and users during the LHC Run 2 (2015-2018). It provides a good and reliable service for the initial use cases (mainly event picking) and several additional ones, such as production consistency checks, duplicate event detection and measurements of the overlaps of trigger chains and derivation datasets. The LHC Run 3, starting in 2021, will see increased data-taking and simulation production rates, with which the current infrastructure would still cope but may be stretched to its limits by the end of Run 3. This proceeding describes the implementation of a new core storage service that will be able to provide at least the same functionality as the current one for increased data ingestion and search rates, and with increasing volumes of stored data. It is based on a set of HBase tables, with schemas derived from the current Oracle implementation, coupled to Apache Phoenix for data access; in this way we will add to the advantages of a BigData based storage system the possibility of SQL as well as NoSQL data access, allowing to re-use most of the existing code for metadata integration.

Download Full-text

A computationally efficient approach to segmentation of the aorta and coronary arteries using deep learning

10.1101/2021.02.18.21252005 ◽

2021 ◽

Author(s):

Wing Keung Cheung ◽

Robert Bell ◽

Arjun Nair ◽

Leon Menezies ◽

Riyaz Patel ◽

...

Keyword(s):

Deep Learning ◽

Coronary Arteries ◽

Automatic Segmentation ◽

Three Dimensional ◽

Regions Of Interest ◽

Dice Similarity Coefficient ◽

Processing Unit ◽

Two Dimensional ◽

Computed Tomography Images ◽

Graphical Processing

AbstractA fully automatic two-dimensional Unet model is proposed to segment aorta and coronary arteries in computed tomography images. Two models are trained to segment two regions of interest, (1) the aorta and the coronary arteries or (2) the coronary arteries alone. Our method achieves 91.20% and 88.80% dice similarity coefficient accuracy on regions of interest 1 and 2 respectively. Compared with a semi-automatic segmentation method, our model performs better when segmenting the coronary arteries alone. The performance of the proposed method is comparable to existing published two-dimensional or three-dimensional deep learning models. Furthermore, the algorithmic and graphical processing unit memory efficiencies are maintained such that the model can be deployed within hospital computer networks where graphical processing units are typically not available.

Download Full-text