A Machine Learning and Cross-Validation Approach for the Discrimination of Vegetation Physiognomic Types Using Satellite Based Multispectral and Multitemporal Data

Scientifica ◽

10.1155/2017/9806479 ◽

2017 ◽

Vol 2017 ◽

pp. 1-8 ◽

Cited By ~ 7

Author(s):

Ram C. Sharma ◽

Keitarou Hara ◽

Hidetake Hirayama

Keyword(s):

Machine Learning ◽

Time Series ◽

Cross Validation ◽

Coniferous Forest ◽

Ground Truth ◽

Model Parameters ◽

Broadleaf Forest ◽

Ground Truth Data ◽

Supervised Classifiers ◽

Performance And Evaluation

This paper presents the performance and evaluation of a number of machine learning classifiers for the discrimination between the vegetation physiognomic classes using the satellite based time-series of the surface reflectance data. Discrimination of six vegetation physiognomic classes, Evergreen Coniferous Forest, Evergreen Broadleaf Forest, Deciduous Coniferous Forest, Deciduous Broadleaf Forest, Shrubs, and Herbs, was dealt with in the research. Rich-feature data were prepared from time-series of the satellite data for the discrimination and cross-validation of the vegetation physiognomic types using machine learning approach. A set of machine learning experiments comprised of a number of supervised classifiers with different model parameters was conducted to assess how the discrimination of vegetation physiognomic classes varies with classifiers, input features, and ground truth data size. The performance of each experiment was evaluated by using the 10-fold cross-validation method. Experiment using the Random Forests classifier provided highest overall accuracy (0.81) and kappa coefficient (0.78). However, accuracy metrics did not vary much with experiments. Accuracy metrics were found to be very sensitive to input features and size of ground truth data. The results obtained in the research are expected to be useful for improving the vegetation physiognomic mapping in Japan.

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

Prediction and Forecasting of Air Quality Index in Chennai using Regression and ARIMA time series models

Journal of Engineering Research ◽

10.36909/jer.10253 ◽

2021 ◽

Vol 9 ◽

Author(s):

Geetha Mani ◽

◽

Joshi Kumar Viswanadhapalli ◽

Albert Alexander Stonie ◽

◽

...

Keyword(s):

Machine Learning ◽

Time Series ◽

Air Quality ◽

Linear Regression ◽

Quality Index ◽

Air Quality Index ◽

Model Parameters ◽

Sensor Output ◽

Model Accuracy ◽

Life On Earth

Air is one of the most fundamental constituents for the sustenance of life on earth. The meteorological, traffic factors, consumption of non-renewable energy sources, and industrial parameters are steadily increasing air pollution. These factors affect the welfare and prosperity of life on earth; therefore, the nature of air quality in our environment needs to be monitored continuously. The Air Quality Index (AQI), which indicates air quality, is influenced by several individual factors such as the accumulation of NO2, CO, O3, PM2.5, SO2, and PM10. This research paper aims to predict and forecast the AQI with Machine Learning (ML) techniques, namely linear regression and time series analysis. Primarily,Multi Linear Regression (MLR) model, supervised machine learning, is developed to predict AQI. NO2, Ozone(O3), PM 2.5, and SO2 sensor output collected from Central Pollution Control Board (CPCB) – Chennai region, India feed as input features and optimized AQI calculated from sensor's output set as a target to train the regression model. The obtained model parameters are validated with new and unseen sensor output. The Key Performance Indices(KPI) like co-efficient of determination, root mean square error and mean absolute error were calculated to validate the model accuracy. The K-cross-fold validation for testing data of MLR was obtained as around 92%. Secondly, the Auto-Regressive Integrated Moving Average (ARIMA) time series model is applied to forecast the AQI. The obtained model parameters were validated with unseen data with a timestamp. The forecasted AQI value of the next 15 days lies in a 95 % confidence interval zone. The model accuracy of test data was obtained as more than 80%.

Download Full-text

EXPLORING MACHINE LEARNING CLASSIFICATION ALGORITHMS FOR CROP CLASSIFICATION USING SENTINEL 2 DATA

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-3-w6-573-2019 ◽

2019 ◽

Vol XLII-3/W6 ◽

pp. 573-578 ◽

Cited By ~ 3

Author(s):

◽

S. S. Ray

Keyword(s):

Machine Learning ◽

Satellite Data ◽

Classification Accuracy ◽

Ground Truth ◽

Kappa Coefficient ◽

Ground Truth Data ◽

Classification Techniques ◽

Machine Learning Classification ◽

Crop Classification ◽

Sentinel 2

Abstract. Crop Classification and recognition is a very important application of Remote Sensing. In the last few years, Machine learning classification techniques have been emerging for crop classification. Google Earth Engine (GEE) is a platform to explore the multiple satellite data with different advanced classification techniques without even downloading the satellite data. The main objective of this study is to explore the ability of different machine learning classification techniques like, Random Forest (RF), Classification And Regression Trees (CART) and Support Vector Machine (SVM) for crop classification. High Resolution optical data, Sentinel-2, MSI (10&thinsp;m) was used for crop classification in the Indian Agricultural Research Institute (IARI) farm for the Rabi season 2016 for major crops. Around 100 crop fields (~400 Hectare) in IARI were analysed. Smart phone-based ground truth data were collected. The best cloud free image of Sentinel 2 MSI data (5 Feb 2016) was used for classification using automatic filtering by percentage cloud cover property using the GEE. Polygons as feature space was used as training data sets based on the ground truth data for crop classification using machine learning techniques. Post classification, accuracy assessment analysis was done through the generation of the confusion matrix (producer and user accuracy), kappa coefficient and F value. In this study it was found that using GEE through cloud platform, satellite data accessing, filtering and pre-processing of satellite data could be done very efficiently. In terms of overall classification accuracy and kappa coefficient, Random Forest (93.3%, 0.9178) and CART (73.4%, 0.6755) classifiers performed better than SVM (74.3%, 0.6867) classifier. For validation, Field Operation Service Unit (FOSU) division of IARI, data was used and encouraging results were obtained.

Download Full-text

Integrating hierarchical statistical models and machine-learning algorithms for ground-truthing drone images of the vegetation: taxonomy, abundance and population ecological models

10.1101/491381 ◽

2018 ◽

Cited By ~ 1

Author(s):

Christian Damgaard

Keyword(s):

Machine Learning ◽

Statistical Models ◽

Learning Algorithms ◽

Plant Competition ◽

Image Data ◽

Ground Truth ◽

Ecological Models ◽

Machine Learning Algorithms ◽

Ground Truth Data ◽

Ground Truthing

AbstractIn order to fit population ecological models, e.g. plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Download Full-text

CARTOGRAPHY OF MOROCCAN ARGAN TREE USING COMBINED OPTICAL AND SAR IMAGERY INTEGRATED WITH DIGITAL ELEVATION MODEL

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlvi-4-w5-2021-211-2021 ◽

2021 ◽

Vol XLVI-4/W5-2021 ◽

pp. 211-217

Author(s):

E. Elmoussaoui ◽

A. Moumni ◽

A. Lahrouni

Keyword(s):

Machine Learning ◽

Time Series ◽

Digital Elevation Model ◽

Optical Data ◽

Support Vector ◽

Ground Truth Data ◽

Argan Tree ◽

Digital Elevation ◽

Elevation Model ◽

Sentinel 2

Abstract. Forest tree species mapping became easier due to the global availability of high spatio-temporal resolution images acquired from multiple sensors. Such data can lead to better forest resources management. Machine-learning pixel based analysis was performed to multi-spectral Sentinel-2 and Synthetic Aperture Radar Sentinel-1 time series integrated with Digital Elevation Model acquired over Argan forest of Essaouira province, Morocco. The argan tree constitutes a fundamental resource for the populations of this arid area of Morocco. This research aims to use the potential of the combination of multi-sensor data to detect, map and identify argan tree from other forest species using three Machine Learning algorithms: Support Vector Machine (SVM), Maximum Likelihood (ML) and Artificial Neural Networks (ANN). The exploited datasets included Sentinel-1 (S1), Sentinel-2 (S2) time series, Shuttle Radar Topographic Missing Digital Elevation Model (DEM) layer and Ground truth data. We tested several sets of scenarios, including single S1 derived features, single S2 time series and combined S1 and S2 derived layers with DEM scene acquisition. The best results (overall accuracy OA and Kappa coefficient K) obtained from time series of optical data (NDVI): OA = 86.87%, K = 0.84, from time series of SAR data (VV+VH/VV): OA = 45.90%, K = 0.36, from the combination of optical and SAR time series (NDVI+VH+DEM): OA = 93.01%, K = 0.914, and from the fusion of optical time series and DEM layer (NDVI+DEM): OA = 93.25%, K = 0.91. These results indicate that single-sensor (S2) integrated with the DEM layer led us to obtain the highest classification results.

Download Full-text

A Comparative Assessment of Ensemble-Based Machine Learning and Maximum Likelihood Methods for Mapping Seagrass Using Sentinel-2 Imagery in Tauranga Harbor, New Zealand

Remote Sensing ◽

10.3390/rs12030355 ◽

2020 ◽

Vol 12 (3) ◽

pp. 355 ◽

Cited By ~ 10

Author(s):

Nam Thang Ha ◽

Merilyn Manley-Harris ◽

Tien Dat Pham ◽

Ian Hawes

Keyword(s):

Machine Learning ◽

New Zealand ◽

Maximum Likelihood ◽

Ground Truth ◽

Machine Learning Techniques ◽

Ground Truth Data ◽

Seagrass Meadows ◽

Ensemble Machine Learning ◽

Novel Approach ◽

Sentinel 2

Seagrass has been acknowledged as a productive blue carbon ecosystem that is in significant decline across much of the world. A first step toward conservation is the mapping and monitoring of extant seagrass meadows. Several methods are currently in use, but mapping the resource from satellite images using machine learning is not widely applied, despite its successful use in various comparable applications. This research aimed to develop a novel approach for seagrass monitoring using state-of-the-art machine learning with data from Sentinel–2 imagery. We used Tauranga Harbor, New Zealand as a validation site for which extensive ground truth data are available to compare ensemble machine learning methods involving random forests (RF), rotation forests (RoF), and canonical correlation forests (CCF) with the more traditional maximum likelihood classifier (MLC) technique. Using a group of validation metrics including F1, precision, recall, accuracy, and the McNemar test, our results indicated that machine learning techniques outperformed the MLC with RoF as the best performer (F1 scores ranging from 0.75–0.91 for sparse and dense seagrass meadows, respectively). Our study is the first comparison of various ensemble-based methods for seagrass mapping of which we are aware, and promises to be an effective approach to enhance the accuracy of seagrass monitoring.

Download Full-text

On the potential and challenges of using machine-learning for automated quality control of environmental sensor data

10.5194/egusphere-egu2020-20777 ◽

2020 ◽

Author(s):

Lennart Schmidt ◽

Hannes Mollenhauer ◽

Corinna Rebmann ◽

David Schäfer ◽

Antje Claussnitzer ◽

...

Keyword(s):

Machine Learning ◽

Quality Control ◽

Ground Truth ◽

Sensor Data ◽

Small Scale ◽

Ground Truth Data ◽

Starting Point ◽

Environmental Sensor ◽

Spatio Temporal ◽

Automated Quality Control

With more and more data being gathered from environmental sensor networks, the importance of automated quality-control (QC) routines to provide usable data in near-real time is becoming increasingly apparent. Machine-learning (ML) algorithms exhibit a high potential to this respect as they are able to exploit the spatio-temporal relation of multiple sensors to identify anomalies while allowing for non-linear functional relations in the data. In this study, we evaluate the potential of ML for automated QC on two spatio-temporal datasets at different spatial scales: One is a dataset of atmospheric variables at 53 stations across Northern Germany. The second dataset contains timeseries of soil moisture and temperature at 40 sensors at a small-scale measurement plot.Furthermore, we investigate strategies to tackle three challenges that are commonly present when applying ML for QC: 1) As sensors might drop out, the ML models have to be designed to be robust against missing values in the input data. We address this by comparing different data imputation methods, coupled with a binary representation of whether a value is missing or not. 2) Quality flags that mark erroneous data points to serve as ground truth for model training might not be available. And 3) There is no guarantee that the system under study is stationary, which might render the outputs of a trained model useless in the future. To address 2) and 3), we frame the problem both as a supervised and unsupervised learning problem. Here, the use of unsupervised ML-models can be beneficial as they do not require ground truth data and can thus be retrained more easily should the system be subject to significant changes. In this presentation, we discuss the performance, advantages and drawbacks of the proposed strategies to tackle the aforementioned challenges. Thus, we provide a starting point for researchers in the largely untouched field of ML application for automated quality control of environmental sensor data.

Download Full-text

Study on the Effectiveness of the Investment Strategy Based on a Classifier with Rules Adapted by Machine Learning

ISRN Artificial Intelligence ◽

10.1155/2014/451849 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

A. Wiliński ◽

A. Bera ◽

W. Nowicki ◽

P. Błaszyński

Keyword(s):

Machine Learning ◽

Time Series ◽

Parameter Space ◽

Cross Validation ◽

Investment Decision ◽

Investment Strategy ◽

Time Varying ◽

Time Varying Parameters ◽

Rule Set ◽

The Relationship

This paper examines two transactional strategies based on the classifier which opens positions using some rules and closes them using different rules. A rule set contains time-varying parameters that when matched allow making an investment decision. Researches contain the study of variability of these parameters and the relationship between learning period and testing (using the learned parameters). The strategies are evaluated based on the time series of cumulative profit achieved in the test periods. The study was conducted on the most popular currency pair EURUSD (Euro-Dollar) sampled with interval of 1 hour. An important contribution to the theory of algotrading resulting from presented research is specification of the parameter space (quite large, consisting of 11 parameters) that achieves very good results using cross validation.

Download Full-text

Integrating Hierarchical Statistical Models and Machine-Learning Algorithms for Ground-Truthing Drone Images of the Vegetation: Taxonomy, Abundance and Population Ecological Models

Remote Sensing ◽

10.3390/rs13061161 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1161

Author(s):

Christian Damgaard

Keyword(s):

Machine Learning ◽

Statistical Models ◽

Learning Algorithms ◽

Plant Competition ◽

Image Data ◽

Ground Truth ◽

Ecological Models ◽

Machine Learning Algorithms ◽

Ground Truth Data ◽

Ground Truthing

In order to fit population ecological models, e.g., plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Download Full-text

Collective annotation patterns in learning from crowds

Intelligent Data Analysis ◽

10.3233/ida-200009 ◽

2020 ◽

Vol 24 ◽

pp. 63-86

Author(s):

Francisco Mena ◽

Ricardo Ñanculef ◽

Carlos Valle

Keyword(s):

Machine Learning ◽

Large Scale ◽

Ground Truth ◽

Experimental Results ◽

Ground Truth Data ◽

Satisfactory Performance ◽

Machine Learning Applications ◽

Data Points ◽

Confusion Matrices

The lack of annotated data is one of the major barriers facing machine learning applications today. Learning from crowds, i.e. collecting ground-truth data from multiple inexpensive annotators, has become a common method to cope with this issue. It has been recently shown that modeling the varying quality of the annotations obtained in this way, is fundamental to obtain satisfactory performance in tasks where inexpert annotators may represent the majority but not the most trusted group. Unfortunately, existing techniques represent annotation patterns for each annotator individually, making the models difficult to estimate in large-scale scenarios. In this paper, we present two models to address these problems. Both methods are based on the hypothesis that it is possible to learn collective annotation patterns by introducing confusion matrices that involve groups of data point annotations or annotators. The first approach clusters data points with a common annotation pattern, regardless the annotators from which the labels have been obtained. Implicitly, this method attributes annotation mistakes to the complexity of the data itself and not to the variable behavior of the annotators. The second approach explicitly maps annotators to latent groups that are collectively parametrized to learn a common annotation pattern. Our experimental results show that, compared with other methods for learning from crowds, both methods have advantages in scenarios with a large number of annotators and a small number of annotations per annotator.

Download Full-text