A Survey of Active Learning for Quantifying Vegetation Traits from Terrestrial Earth Observation Data

Katja Berger; Juan Pablo Rivera Caicedo; Luca Martino; Matthias Wocher; Tobias Hank; Jochem Verrelst

doi:10.3390/rs13020287

A Survey of Active Learning for Quantifying Vegetation Traits from Terrestrial Earth Observation Data

Remote Sensing ◽

10.3390/rs13020287 ◽

2021 ◽

Vol 13 (2) ◽

pp. 287

Author(s):

Katja Berger ◽

Juan Pablo Rivera Caicedo ◽

Luca Martino ◽

Matthias Wocher ◽

Tobias Hank ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Earth Observation ◽

Training Data ◽

Data Sets ◽

Sampled Data ◽

Vegetation Monitoring ◽

Regression Algorithms ◽

Regression Problems ◽

Sampling Procedures

The current exponential increase of spatiotemporally explicit data streams from satellite-based Earth observation missions offers promising opportunities for global vegetation monitoring. Intelligent sampling through active learning (AL) heuristics provides a pathway for fast inference of essential vegetation variables by means of hybrid retrieval approaches, i.e., machine learning regression algorithms trained by radiative transfer model (RTM) simulations. In this study we summarize AL theory and perform a brief systematic literature survey about AL heuristics used in the context of Earth observation regression problems over terrestrial targets. Across all relevant studies it appeared that: (i) retrieval accuracy of AL-optimized training data sets outperformed models trained over large randomly sampled data sets, and (ii) Euclidean distance-based (EBD) diversity method tends to be the most efficient AL technique in terms of accuracy and computational demand. Additionally, a case study is presented based on experimental data employing both uncertainty and diversity AL criteria. Hereby, a a simulated training data base by the PROSAIL-PRO canopy RTM is used to demonstrate the benefit of AL techniques for the estimation of total leaf carotenoid content (Cxc) and leaf water content (Cw). Gaussian process regression (GPR) was incorporated to minimize and optimize the training data set with AL. Training the GPR algorithm on optimally AL-based sampled data sets led to improved variable retrievals compared to training on full data pools, which is further demonstrated on a mapping example. From these findings we can recommend the use of AL-based sub-sampling procedures to select the most informative samples out of large training data pools. This will not only optimize regression accuracy due to exclusion of redundant information, but also speed up processing time and reduce final model size of kernel-based machine learning regression algorithms, such as GPR. With this study we want to encourage further testing and implementation of AL sampling methods for hybrid retrieval workflows. AL can contribute to the solution of regression problems within the framework of operational vegetation monitoring using satellite imaging spectroscopy data, and may strongly facilitate data processing for cloud-computing platforms.

Download Full-text

AI-Ready Training Datasets for Earth Observation: Enabling FAIR data principles for EO training data.

10.5194/egusphere-egu21-12384 ◽

2021 ◽

Author(s):

Alastair McKinstry ◽

Oisin Boydell ◽

Quan Le ◽

Inder Preet ◽

Jennifer Hanafin ◽

...

Keyword(s):

Machine Learning ◽

Best Practices ◽

Forest Biomass ◽

Earth Observation ◽

Training Data ◽

Training Dataset ◽

Data Provenance ◽

Data Sets ◽

Model Training ◽

Ice Detection

The ESA-funded AIREO project [1] sets out to produce AI-ready training dataset specifications and best practices to support the training and development of machine learning models on Earth Observation (EO) data. While the quality and quantity of EO data has increased drastically over the past decades, availability of training data for machine learning applications is considered a major bottleneck. The goal is to move towards implementing FAIR data principles for training data in EO, enhancing especially the finability, interoperability and reusability aspects.&#160; To achieve this goal, AIREO sets out to provide a training data specification and to develop best practices for the use of training datasets in EO. An additional goal is to make training data sets self-explanatory (&#8220;AI-ready) in order to expose challenging problems to a wider audience that does not have expert geospatial knowledge.&#160;Key elements that are addressed in the AIREO specification are granular and interoperable metadata (based on STAC), innovative Quality Assurance metrics, data provenance and processing history as well as integrated feature engineering recipes that optimize platform independence. Several initial pilot datasets are being developed following the AIREO data specifications. These pilot applications include for example&#160; forest biomass, sea ice detection and the estimation of atmospheric parameters.An API for the easy exploitation of these datasets will be provided.to allow the Training Datasets (TDS) to work against EO catalogs (based on OGC STAC catalogs and best practises from ML community) to allow updating and updated model training over time.&#160;This presentation will present the first version of the AIREO training dataset specification and will showcase some elements of the best-practices that were developed. The AIREO compliant pilot datasets will be presented which are openly accessible and community feedback is explicitly encouraged.&#160; [1] https://aireo.net/

Download Full-text

Headnote Prediction Using Machine Learning

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/5/7 ◽

2021 ◽

Vol 18 (5) ◽

Author(s):

Sarmad Mahar ◽

Sahar Zafar ◽

Kamran Nishat

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Active Learning ◽

Text Classification ◽

Extraction Methods ◽

Text Summarization ◽

Training Data ◽

Second Step ◽

Support Vector ◽

Classification Algorithms

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.

Download Full-text

Artificially Generated Training Data-sets for Supervised Machine Learning Techniques in Magnetic Resonance Imaging: An Example in Myocardial Segmentation

2019 Computing in Cardiology Conference (CinC) ◽

10.22489/cinc.2019.220 ◽

2019 ◽

Author(s):

Christos Xanthis ◽

Kostas Haris ◽

Dimitrios Filos ◽

Anthony Aletras

Keyword(s):

Magnetic Resonance Imaging ◽

Machine Learning ◽

Magnetic Resonance ◽

Training Data ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Data Sets ◽

Resonance Imaging ◽

Learning Techniques ◽

Myocardial Segmentation

Download Full-text

Curator: A No-Code Self-Supervised Learning and Active Labeling Tool to Create Labeled Image Datasets from Petabyte-Scale Imagery

10.5194/egusphere-egu21-6853 ◽

2021 ◽

Author(s):

Rudy Venguswamy ◽

Mike Levy ◽

Anirudh Koul ◽

Satyarth Praveen ◽

Tarun Narayanan ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Open Source ◽

Forest Fires ◽

Seed Set ◽

Training Data ◽

Training Dataset ◽

Reference Image ◽

Query Image ◽

Real World Datasets

Machine learning modeling for Earth events at NASA is often limited by the availability of labeled examples. For example, training classifiers for forest fires or oil spills from satellite imagery requires curating a massive and diverse dataset of example forest fires, a tedious multi-month effort requiring careful review of over 196.9 million square miles of data per day for 20 years. While such images might exist in abundance within 40 petabytes of unlabeled satellite data, finding these positive examples to include in a training dataset for a machine learning model is extremely time-consuming and requires researchers to "hunt" for positive examples, like finding a needle in a haystack.&#160;We present a no-code open-source tool, Curator, whose goal is to minimize the amount of human manual image labeling needed to achieve a state of the art classifier. The pipeline, purpose-built to take advantage of the massive amount of unlabeled images, consists of (1) self-supervision training to convert unlabeled images into meaningful representations, (2) search-by-example to collect a seed set of images, (3) human-in-the-loop active learning to iteratively ask for labels on uncertain examples and train on them.&#160;In step 1, a model capable of representing unlabeled images meaningfully is trained with a self-supervised algorithm (like SimCLR) on a random subset of the dataset (that conforms to researchers&#8217; specified &#8220;training budget.&#8221;). Since real-world datasets are often imbalanced leading to suboptimal models, the initial model is used to generate embeddings on the entire dataset. Then, images with equidistant embeddings are sampled. This iterative training and resampling strategy improves both balanced training data and models every iteration. In step 2, researchers supply an example image of interest, and the output embeddings generated from this image are used to find other images with embeddings near the reference image&#8217;s embedding in euclidean space (hence similar looking images to the query image). These proposed candidate images contain a higher density of positive examples and are annotated manually as a seed set. In step 3, the seed labels are used to train a classifier to identify more candidate images for human inspection with active learning. Each classification training loop, candidate images for labeling are sampled from the larger unlabeled dataset based on the images that the model is most uncertain about (p &#8776; 0.5).Curator is released as an open-source package built on PyTorch-Lightning. The pipeline uses GPU-based transforms from the NVIDIA-Dali package for augmentation, leading to a 5-10x speed up in self-supervised training and is run from the command line.By iteratively training a self-supervised model and a classifier in tandem with human manual annotation, this pipeline is able to unearth more positive examples from severely imbalanced datasets which were previously untrainable with self-supervision algorithms. In applications such as detecting wildfires, atmospheric dust, or turning outward with telescopic surveys, increasing the number of positive candidates presented to humans for manual inspection increases the efficacy of classifiers and multiplies the efficiency of researchers&#8217; data curation efforts.

Download Full-text

Application of multi-sensor unmanned aerial system for identification of hydrothermal alteration zones

10.5194/egusphere-egu2020-12546 ◽

2020 ◽

Author(s):

Yosoon Choi ◽

Jieun Baek ◽

Jangwon Suh ◽

Sung-Min Kim

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Training Data ◽

Sensor Data ◽

Machine Learning Techniques ◽

Integrated Analysis ◽

Unmanned Aerial System ◽

Data Sets ◽

Learning Techniques ◽

Hydrothermal Alteration Zones

In this study, we proposed a method to utilize a multi-sensor Unmanned Aerial System (UAS) for exploration of hydrothermal alteration zones. This study selected an area (10m &#215; 20m) composed mainly of the andesite and located on the coast, with wide outcrops and well-developed structural and mineralization elements. Multi-sensor (visible, multispectral, thermal, magnetic) data were acquired in the study area using UAS, and were studied using machine learning techniques. For utilizing the machine learning techniques, we applied the stratified random method to sample 1000 training data in the hydrothermal zone and 1000 training data in the non-hydrothermal zone identified through the field survey. The 2000 training data sets created for supervised learning were first classified into 1500 for training and 500 for testing. Then, 1500 for training were classified into 1200 for training and 300 for validation. The training and validation data for machine learning were generated in five sets to enable cross-validation. Five types of machine learning techniques were applied to the training data sets: k-Nearest Neighbors (k-NN), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Deep Neural Network (DNN). As a result of integrated analysis of multi-sensor data using five types of machine learning techniques, RF and SVM techniques showed high classification accuracy of about 90%. Moreover, performing integrated analysis using multi-sensor data showed relatively higher classification accuracy in all five machine learning techniques than analyzing magnetic sensing data or single optical sensing data only.

Download Full-text

A Modified Incremental Support Vector Machine for Regression

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.135-136.63 ◽

2011 ◽

Vol 135-136 ◽

pp. 63-69 ◽

Cited By ~ 1

Author(s):

Jian Guo Wang ◽

Liang Wu Cheng ◽

Wen Xing Zhang ◽

Bo Qin

Keyword(s):

Support Vector Machine ◽

Predictive Power ◽

Training Data ◽

Support Vector ◽

Mechanical Equipment ◽

Final Decision ◽

Data Sets ◽

Regression Problems ◽

Traditional Approaches ◽

Speed And Accuracy

support vector machine (SVM) has been shown to exhibit superior predictive power compared to traditional approaches in many studies, such as mechanical equipment monitoring and diagnosis. However, SVM training is very costly in terms of time and memory consumption due to the enormous amounts of training data and the quadratic programming problem. In order to improve SVM training speed and accuracy, we propose a modified incremental support vector machine (MISVM) for regression problems in this paper. The main concepts are that using the distance from the margin vectors which violate the Karush-Kuhn-Tucker (KKT) condition to the final decision hyperplane to evaluate the importance of each margin vectors, and the margin vectors whose distance is below the specified value are preserved, the others are eliminated. Then the original SVs and the remaining margin vectors are used to train a new SVM. The proposed MISVM can not only eliminate the unimportant samples such as noise samples, but also preserved the important samples. The effectiveness of the proposed MISVMs is demonstrated with two UCI data sets. These experiments also show that the proposed MISVM is competitive with previously published methods.

Download Full-text

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocu041 ◽

2015 ◽

Vol 22 (3) ◽

pp. 671-681 ◽

Cited By ~ 145

Author(s):

Azadeh Nikfarjam ◽

Abeed Sarker ◽

Karen O’Connor ◽

Rachel Ginn ◽

Graciela Gonzalez

Keyword(s):

Machine Learning ◽

Social Media ◽

Language Processing ◽

High Performance ◽

Conditional Random Fields ◽

Training Data ◽

Data Sets ◽

Social Media Mining ◽

Medical Concepts ◽

Media Mining

Abstract Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words’ semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

Download Full-text

Predicting Fault Slip via Transfer Learning

10.21203/rs.3.rs-700852/v1 ◽

2021 ◽

Author(s):

Kun Wang ◽

Christopher Johnson ◽

Kane Bennett ◽

Paul Johnson

Keyword(s):

Machine Learning ◽

Numerical Simulations ◽

Transfer Learning ◽

Laboratory Experiments ◽

Laboratory Data ◽

Fault Slip ◽

Geophysical Data ◽

Training Data ◽

Data Sets ◽

Earthquake Cycle

Abstract Data-driven machine-learning for predicting instantaneous and future fault-slip in laboratory experiments has recently progressed markedly due to large training data sets. In Earth however, earthquake interevent times range from 10's-100's of years and geophysical data typically exist for only a portion of an earthquake cycle. Sparse data presents a serious challenge to training machine learning models. Here we describe a transfer learning approach using numerical simulations to train a convolutional encoder-decoder that predicts fault-slip behavior in laboratory experiments. The model learns a mapping between acoustic emission histories and fault-slip from numerical simulations, and generalizes to produce accurate results using laboratory data. Notably slip-predictions markedly improve using the simulation-data trained-model and training the latent space using a portion of a single laboratory earthquake-cycle. The transfer learning results elucidate the potential of using models trained on numerical simulations and fine-tuned with small geophysical data sets for potential applications to faults in Earth.

Download Full-text

Integrating active learning and crowdsourcing into large-scale supervised landcover mapping algorithms

10.7287/peerj.preprints.3004v1 ◽

2017 ◽

Cited By ~ 1

Author(s):

Stephanie R Debats ◽

Lyndon D Estes ◽

David R Thompson ◽

Kelly K Caylor

Keyword(s):

Active Learning ◽

Large Scale ◽

Learning Algorithm ◽

Training Data ◽

Sub Saharan Africa ◽

Data Sets ◽

Field Patterns ◽

Sub Saharan ◽

Highly Correlated ◽

Computational Resources

Sub-Saharan Africa and other developing regions of the world are dominated by smallholder farms, which are characterized by small, heterogeneous, and often indistinct field patterns. In previous work, we developed an algorithm for mapping both smallholder and commercial agricultural fields that includes efficient extraction of a vast set of simple, highly correlated, and interdependent features, followed by a random forest classifier. In this paper, we demonstrated how active learning can be incorporated in the algorithm to create smaller, more efficient training data sets, which reduced computational resources, minimized the need for humans to hand-label data, and boosted performance. We designed a patch-based uncertainty metric to drive the active learning framework, based on the regular grid of a crowdsourcing platform, and demonstrated how subject matter experts can be replaced with fleets of crowdsourcing workers. Our active learning algorithm achieved similar performance as an algorithm trained with randomly selected data, but with 62% less data samples.

Download Full-text

Gender bias in machine learning for sentiment analysis

Online Information Review ◽

10.1108/oir-05-2017-0153 ◽

2018 ◽

Vol 42 (3) ◽

pp. 343-354 ◽

Cited By ~ 3

Author(s):

Mike Thelwall

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Gender Bias ◽

Training Data ◽

Data Sets ◽

Content Type ◽

Female Authors ◽

Gender Biases ◽

Mixed Gender ◽

Gender Specific

Purpose The purpose of this paper is to investigate whether machine learning induces gender biases in the sense of results that are more accurate for male authors or for female authors. It also investigates whether training separate male and female variants could improve the accuracy of machine learning for sentiment analysis. Design/methodology/approach This paper uses ratings-balanced sets of reviews of restaurants and hotels (3 sets) to train algorithms with and without gender selection. Findings Accuracy is higher on female-authored reviews than on male-authored reviews for all data sets, so applications of sentiment analysis using mixed gender data sets will over represent the opinions of women. Training on same gender data improves performance less than having additional data from both genders. Practical implications End users of sentiment analysis should be aware that its small gender biases can affect the conclusions drawn from it and apply correction factors when necessary. Users of systems that incorporate sentiment analysis should be aware that performance will vary by author gender. Developers do not need to create gender-specific algorithms unless they have more training data than their system can cope with. Originality/value This is the first demonstration of gender bias in machine learning sentiment analysis.

Download Full-text