Distribution-preserving data augmentation

PeerJ Computer Science ◽

10.7717/peerj-cs.571 ◽

2021 ◽

Vol 7 ◽

pp. e571

Author(s):

Nurdan Ayse Saran ◽

Murat Saran ◽

Fatih Nar

Keyword(s):

Data Augmentation ◽

Image Data ◽

Large Data ◽

Data Availability ◽

Superior Performance ◽

Color Distribution ◽

Spatial Transformations ◽

Wide Range ◽

Dataset Size ◽

Existing Data

In the last decade, deep learning has been applied in a wide range of problems with tremendous success. This success mainly comes from large data availability, increased computational power, and theoretical improvements in the training phase. As the dataset grows, the real world is better represented, making it possible to develop a model that can generalize. However, creating a labeled dataset is expensive, time-consuming, and sometimes not likely in some domains if not challenging. Therefore, researchers proposed data augmentation methods to increase dataset size and variety by creating variations of the existing data. For image data, variations can be obtained by applying color or spatial transformations, only one or a combination. Such color transformations perform some linear or nonlinear operations in the entire image or in the patches to create variations of the original image. The current color-based augmentation methods are usually based on image processing methods that apply color transformations such as equalizing, solarizing, and posterizing. Nevertheless, these color-based data augmentation methods do not guarantee to create plausible variations of the image. This paper proposes a novel distribution-preserving data augmentation method that creates plausible image variations by shifting pixel colors to another point in the image color distribution. We achieved this by defining a regularized density decreasing direction to create paths from the original pixels’ color to the distribution tails. The proposed method provides superior performance compared to existing data augmentation methods which is shown using a transfer learning scenario on the UC Merced Land-use, Intel Image Classification, and Oxford-IIIT Pet datasets for classification and segmentation tasks.

Download Full-text

Improving data availability for brain image biobanking in healthy subjects: practice-based suggestions from an international multidisciplinary working group

10.1101/110460 ◽

2017 ◽

Author(s):

◽

Susan D Shenkin ◽

Cyril Pernet ◽

Thomas E Nichols ◽

Jean-Baptiste Poline ◽

...

Keyword(s):

Data Storage ◽

Healthy Subjects ◽

Image Data ◽

Lessons Learned ◽

Data Availability ◽

Brain Diseases ◽

Linkage Data ◽

Brain Image ◽

Wide Range ◽

Computer Scientists

AbstractBrain imaging is now ubiquitous in clinical practice and research. The case for bringing together large amounts of image data from well-characterised healthy subjects and those with a range of common brain diseases across the life course is now compelling. This report follows a meeting of international experts from multiple disciplines, all interested in brain image biobanking. The meeting included neuroimaging experts (clinical and non-clinical), computer scientists, epidemiologists, clinicians, ethicists, and lawyers involved in creating brain image banks. The meeting followed a structured format to discuss current and emerging brain image banks; applications such as atlases; conceptual and statistical problems (e.g. defining ‘normality’); legal, ethical and technological issues (e.g. consents, potential for data linkage, data security, harmonisation, data storage and enabling of research data sharing). We summarise the lessons learned from the experiences of a wide range of individual image banks, and provide practical recommendations to enhance creation, use and reuse of neuroimaging data. Our aim is to maximise the benefit of the image data, provided voluntarily by research participants and funded by many organisations, for human health. Our ultimate vision is of a federated network of brain image biobanks accessible for large studies of brain structure and function.

Download Full-text

An Analysis of Data Processing for Big Data Analytics

Journal of Computing and Natural Science ◽

10.53759/181x/jcns202101019 ◽

2021 ◽

pp. 130-138

Author(s):

Steve Blair ◽

Jon Cotter

Keyword(s):

Big Data ◽

Deep Learning ◽

Data Analytics ◽

High Performance ◽

Big Data Analytics ◽

Large Data ◽

Data Availability ◽

Large Data Sets ◽

Data Sets ◽

Wide Range

The need for high-performance Data Mining (DM) algorithms is being driven by the exponentially increasing data availability such as images, audio and video from a variety of domains, including social networks and the Internet of Things (IoT). Deep learning is an emerging field of pattern recognition and Machine Learning (ML) study right now. It offers computer simulations of numerous nonlinear processing layers of neurons that may be used to learn and interpret data at higher degrees of abstractions. Deep learning models, which may be used in cloud technology and huge computational systems, can inherently capture complex structures of large data sets. Heterogeneousness is one of the most prominent characteristics of large data sets, and Heterogeneous Computing (HC) causes issues with system integration and Advanced Analytics. This article presents HC processing techniques, Big Data Analytics (BDA), large dataset instruments, and some classic ML and DM methodologies. The use of deep learning to Data Analytics is investigated. The benefits of integrating BDA, deep learning, HPC (High Performance Computing), and HC are highlighted. Data Analytics and coping with a wide range of data are discussed.

Download Full-text

Advancing Microsoft Excel’s Potential for Teaching Digital Image Processing and Analysis

Applied Engineering in Agriculture ◽

10.13031/aea.12221 ◽

2018 ◽

Vol 34 (2) ◽

pp. 263-276 ◽

Cited By ~ 3

Author(s):

Peter Ako Larbi

Keyword(s):

Image Processing ◽

Digital Image Processing ◽

Digital Image ◽

Image Data ◽

Large Data ◽

New Method ◽

Good Opportunity ◽

Image Processing And Analysis ◽

Wide Range ◽

Rgb Image

Abstract. Microsoft Excel is not considered a typical software for digital image processing and analysis. However, based on its large data handling and graphing capabilities, as well as its widespread usage, it presents a good opportunity for use as a tool for teaching image data processing or use in demonstrations requiring little training. It also lends itself well as a potentially useful research tool that can benefit a wide range of users including those with little or no computer programming knowledge. This article demonstrates a new method which can be adopted for teaching concepts of image processing and analysis, consisting of systematic procedures for implementing typical operations in Excel. Categories of operations demonstrated using this method include image preprocessing, image enhancement, image classification, analysis of change over time, and image data fusion. Examples of outputs resulting from using this new method are discussed in the article. The success of this proposed method is hinged on the availability of the required image data, based on which a simple graphical user interface (GUI) application was developed in MATLAB. That application, RGBExcel or the later RGB2X, extracts RGB image data from image files of any format and file size, and exports to Excel for processing. Deployed as standalone applications, both versions can be installed on a 64-bit windows computer and run without MATLAB. Keywords: Color images, Multispectral imagery, Remote sensing, RGB image data, RGB2X, RGBExcel.

Download Full-text

Highly accurate differentiation of bone marrow cell morphologies using deep neural networks on a large image data set

Blood ◽

10.1182/blood.2020010568 ◽

2021 ◽

Vol 138 (20) ◽

pp. 1917-1927

Author(s):

Christian Matek ◽

Sebastian Krappe ◽

Christian Münzenmayer ◽

Torsten Haferlach ◽

Carsten Marr

Keyword(s):

Neural Networks ◽

Bone Marrow ◽

Marrow Cell ◽

Image Data ◽

Large Data ◽

Classification Problem ◽

Data Sets ◽

Data Set ◽

Automated Evaluation ◽

Wide Range

Abstract Biomedical applications of deep learning algorithms rely on large expert annotated data sets. The classification of bone marrow (BM) cell cytomorphology, an important cornerstone of hematological diagnosis, is still done manually thousands of times every day because of a lack of data sets and trained models. We applied convolutional neural networks (CNNs) to a large data set of 171 374 microscopic cytological images taken from BM smears from 945 patients diagnosed with a variety of hematological diseases. The data set is the largest expert-annotated pool of BM cytology images available in the literature. It allows us to train high-quality classifiers of leukocyte cytomorphology that identify a wide range of diagnostically relevant cell species with high precision and recall. Our CNNs outcompete previous feature-based approaches and provide a proof-of-concept for the classification problem of single BM cells. This study is a step toward automated evaluation of BM cell morphology using state-of-the-art image-classification algorithms. The underlying data set represents an educational resource, as well as a reference for future artificial intelligence–based approaches to BM cytomorphology.

Download Full-text

New Data on UN Mission Mandates 1948–2015: Tasks Assigned to Missions in their Mandates (TAMM)

Journal of Peace Research ◽

10.1177/0022343320980822 ◽

2021 ◽

pp. 002234332098082

Author(s):

Gabriella Lloyd

Keyword(s):

Full Range ◽

Data Availability ◽

Minimally Processed ◽

Fine Grained ◽

Wide Range ◽

Security Guarantees ◽

Existing Data

Tasks Assigned to Missions in their Mandates (TAMM) provides comprehensive new data on the mandates of UN missions between 1948 and 2015. Until now, datasets have described mandates in terms of their influential characteristics, such as whether they are robust or multidimensional, or placed them into broad categories driven by idiosyncratic theoretical expectations. Despite limitations on data availability, mandates have been tied to numerous outcomes related to peacekeeping effectiveness. TAMM meets the need for flexible, minimally processed, and fine-grained data on mission mandates by recording the full range of tasks in mandates. The dataset comes in mission-resolution and mission-month versions that are designed to complement existing data on peacekeeping and to be easily adaptable to a wide range of research interests. In this article, I introduce TAMM and use the data to conduct a replication and expansion of Hultman, Kathman and Shannon (2014). I find evidence that missions with mandates that dictate they provide security guarantees and raise the costs of fighting, reduce battlefield hostilities.

Download Full-text

Using Motion Detection to Measure Social Polarization in the U.S. House of Representatives

Political Analysis ◽

10.1017/pan.2020.25 ◽

2020 ◽

pp. 1-10

Author(s):

Bryce J. Dietrich

Keyword(s):

Motion Detection ◽

House Of Representatives ◽

Social Science Research ◽

Image Data ◽

Science Research ◽

Video Data ◽

Social Scientists ◽

Wide Range ◽

Party Voting ◽

The Many

Abstract Although previous scholars have used image data to answer important political science questions, less attention has been paid to video-based measures. In this study, I use motion detection to understand the extent to which members of Congress (MCs) literally cross the aisle, but motion detection can be used to study a wide range of political phenomena, like protests, political speeches, campaign events, or oral arguments. I find not only are Democrats and Republicans less willing to literally cross the aisle, but this behavior is also predictive of future party voting, even when previous party voting is included as a control. However, this is one of the many ways motion detection can be used by social scientists. In this way, the present study is not the end, but the beginning of an important new line of research in which video data is more actively used in social science research.

Download Full-text

mtDNAcombine: tools to combine sequences from multiple studies

BMC Bioinformatics ◽

10.1186/s12859-021-04048-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eleanor F. Miller ◽

Andrea Manica

Keyword(s):

Sequence Data ◽

Data Extraction ◽

Bayesian Skyline Plot ◽

Model Organisms ◽

Data Sets ◽

Data Handling ◽

Online Database ◽

Genetic Studies ◽

Wide Range ◽

Existing Data

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.

Download Full-text

Data augmentation for computed tomography angiography via synthetic image generation and neural domain adaptation

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2020-0015 ◽

2020 ◽

Vol 6 (1) ◽

Author(s):

Malte Seemann ◽

Lennart Bargsten ◽

Alexander Schlaefer

Keyword(s):

Computed Tomography ◽

Neural Networks ◽

Deep Learning ◽

Medical Imaging ◽

Computed Tomography Angiography ◽

Data Augmentation ◽

Domain Adaptation ◽

Synthetic Image ◽

Wide Range ◽

The Impact

AbstractDeep learning methods produce promising results when applied to a wide range of medical imaging tasks, including segmentation of artery lumen in computed tomography angiography (CTA) data. However, to perform sufficiently, neural networks have to be trained on large amounts of high quality annotated data. In the realm of medical imaging, annotations are not only quite scarce but also often not entirely reliable. To tackle both challenges, we developed a two-step approach for generating realistic synthetic CTA data for the purpose of data augmentation. In the first step moderately realistic images are generated in a purely numerical fashion. In the second step these images are improved by applying neural domain adaptation. We evaluated the impact of synthetic data on lumen segmentation via convolutional neural networks (CNNs) by comparing resulting performances. Improvements of up to 5% in terms of Dice coefficient and 20% for Hausdorff distance represent a proof of concept that the proposed augmentation procedure can be used to enhance deep learning-based segmentation for artery lumen in CTA images.

Download Full-text

Data Augmentation Methods Applying Grayscale Images for Convolutional Neural Networks in Machine Vision

Applied Sciences ◽

10.3390/app11156721 ◽

2021 ◽

Vol 11 (15) ◽

pp. 6721

Author(s):

Jinyeong Wang ◽

Sanghwan Lee

Keyword(s):

Neural Networks ◽

Machine Vision ◽

Object Detection ◽

Image Classification ◽

Convolutional Neural Networks ◽

Data Augmentation ◽

Image Data ◽

Manufacturing Productivity ◽

Smart Factories ◽

Grayscale Images

In increasing manufacturing productivity with automated surface inspection in smart factories, the demand for machine vision is rising. Recently, convolutional neural networks (CNNs) have demonstrated outstanding performance and solved many problems in the field of computer vision. With that, many machine vision systems adopt CNNs to surface defect inspection. In this study, we developed an effective data augmentation method for grayscale images in CNN-based machine vision with mono cameras. Our method can apply to grayscale industrial images, and we demonstrated outstanding performance in the image classification and the object detection tasks. The main contributions of this study are as follows: (1) We propose a data augmentation method that can be performed when training CNNs with industrial images taken with mono cameras. (2) We demonstrate that image classification or object detection performance is better when training with the industrial image data augmented by the proposed method. Through the proposed method, many machine-vision-related problems using mono cameras can be effectively solved by using CNNs.

Download Full-text

A Multi-Task Network with Distance–Mask–Boundary Consistency Constraints for Building Extraction from Aerial Images

Remote Sensing ◽

10.3390/rs13142656 ◽

2021 ◽

Vol 13 (14) ◽

pp. 2656

Author(s):

Furong Shi ◽

Tong Zhang

Keyword(s):

Distance Estimation ◽

Image Data ◽

Learning Technologies ◽

Aerial Images ◽

Superior Performance ◽

Aerial Image ◽

Great Success ◽

Building Extraction ◽

Shape Information ◽

Multi Scale

Deep-learning technologies, especially convolutional neural networks (CNNs), have achieved great success in building extraction from areal images. However, shape details are often lost during the down-sampling process, which results in discontinuous segmentation or inaccurate segmentation boundary. In order to compensate for the loss of shape information, two shape-related auxiliary tasks (i.e., boundary prediction and distance estimation) were jointly learned with building segmentation task in our proposed network. Meanwhile, two consistency constraint losses were designed based on the multi-task network to exploit the duality between the mask prediction and two shape-related information predictions. Specifically, an atrous spatial pyramid pooling (ASPP) module was appended to the top of the encoder of a U-shaped network to obtain multi-scale features. Based on the multi-scale features, one regression loss and two classification losses were used for predicting the distance-transform map, segmentation, and boundary. Two inter-task consistency-loss functions were constructed to ensure the consistency between distance maps and masks, and the consistency between masks and boundary maps. Experimental results on three public aerial image data sets showed that our method achieved superior performance over the recent state-of-the-art models.

Download Full-text