Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification

Christopher A. Ramezan; Timothy A. Warner; Aaron E. Maxwell

doi:10.3390/rs11020185

Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification

Remote Sensing ◽

10.3390/rs11020185 ◽

2019 ◽

Vol 11 (2) ◽

pp. 185 ◽

Cited By ~ 20

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell

Keyword(s):

Machine Learning ◽

Monte Carlo ◽

Spatial Resolution ◽

Cross Validation ◽

High Spatial Resolution ◽

Regional Scale ◽

Sample Selection ◽

Selection Methods ◽

Machine Learning Classification ◽

Leave One Out

High spatial resolution (1–5 m) remotely sensed datasets are increasingly being used to map land covers over large geographic areas using supervised machine learning algorithms. Although many studies have compared machine learning classification methods, sample selection methods for acquiring training and validation data for machine learning, and cross-validation techniques for tuning classifier parameters are rarely investigated, particularly on large, high spatial resolution datasets. This work, therefore, examines four sample selection methods—simple random, proportional stratified random, disproportional stratified random, and deliberative sampling—as well as three cross-validation tuning approaches—k-fold, leave-one-out, and Monte Carlo methods. In addition, the effect on the accuracy of localizing sample selections to a small geographic subset of the entire area, an approach that is sometimes used to reduce costs associated with training data collection, is investigated. These methods are investigated in the context of support vector machines (SVM) classification and geographic object-based image analysis (GEOBIA), using high spatial resolution National Agricultural Imagery Program (NAIP) orthoimagery and LIDAR-derived rasters, covering a 2,609 km2 regional-scale area in northeastern West Virginia, USA. Stratified-statistical-based sampling methods were found to generate the highest classification accuracy. Using a small number of training samples collected from only a subset of the study area provided a similar level of overall accuracy to a sample of equivalent size collected in a dispersed manner across the entire regional-scale dataset. There were minimal differences in accuracy for the different cross-validation tuning methods. The processing time for Monte Carlo and leave-one-out cross-validation were high, especially with large training sets. For this reason, k-fold cross-validation appears to be a good choice. Classifications trained with samples collected deliberately (i.e., not randomly) were less accurate than classifiers trained from statistical-based samples. This may be due to the high positive spatial autocorrelation in the deliberative training set. Thus, if possible, samples for training should be selected randomly; deliberative samples should be avoided.

Download Full-text

Using machine learning to examine street green space types at a high spatial resolution: application in Los Angeles County on socioeconomic disparities in exposure

The Science of The Total Environment ◽

10.1016/j.scitotenv.2021.147653 ◽

2021 ◽

pp. 147653

Author(s):

Yi Sun ◽

Xingzhi Wang ◽

Jiayin Zhu ◽

Liangjian Chen ◽

Yuhang Jia ◽

...

Keyword(s):

Machine Learning ◽

Los Angeles ◽

Spatial Resolution ◽

Los Angeles County ◽

Green Space ◽

High Spatial Resolution ◽

Socioeconomic Disparities

Download Full-text

Plant drought impact detection using ultra-high spatial resolution hyperspectral images and machine learning

International Journal of Applied Earth Observation and Geoinformation ◽

10.1016/j.jag.2021.102364 ◽

2021 ◽

Vol 102 ◽

pp. 102364

Author(s):

Phuong D. Dao ◽

Yuhong He ◽

Cameron Proctor

Keyword(s):

Machine Learning ◽

Spatial Resolution ◽

High Spatial Resolution ◽

Hyperspectral Images ◽

Drought Impact ◽

Impact Detection

Download Full-text

High resolution and Monte Carlo additions to the SASKTRAN radiative transfer model

Atmospheric Measurement Techniques Discussions ◽

10.5194/amtd-8-3357-2015 ◽

2015 ◽

Vol 8 (3) ◽

pp. 3357-3397 ◽

Cited By ~ 1

Author(s):

D. J. Zawada ◽

S. R. Dueck ◽

L. A. Rieger ◽

A. E. Bourassa ◽

N. D. Lloyd ◽

...

Keyword(s):

Monte Carlo ◽

High Resolution ◽

Radiative Transfer ◽

Spatial Resolution ◽

High Spatial Resolution ◽

Reference Model ◽

Monte Carlo Model ◽

Radiative Transfer Model ◽

Transfer Model ◽

Systematic Bias

Abstract. The OSIRIS instrument on board the Odin spacecraft has been measuring limb scattered radiance since 2001. The vertical radiance profiles measured as the instrument nods are inverted, with the aid of the SASKTRAN radiative transfer model, to obtain vertical profiles of trace atmospheric constituents. Here we describe two newly developed modes of the SASKTRAN radiative transfer model: a high spatial resolution mode, and a Monte Carlo mode. The high spatial resolution mode is a successive orders model capable of modelling the multiply scattered radiance when the atmosphere is not spherically symmetric; the Monte Carlo mode is intended for use as a highly accurate reference model. It is shown that the two models agree in a wide variety of solar conditions to within 0.2%. As an example case for both models, Odin-OSIRIS scans were simulated with the Monte Carlo model and retrieved using the high resolution model. A systematic bias of up to 4% in retrieved ozone number density between scans where the instrument is scanning up or scanning down was identified. It was found that calculating the multiply scattered diffuse field at five discrete solar zenith angles is sufficient to eliminate the bias for typical Odin-OSIRIS geometries.

Download Full-text

High-spatial-resolution Monte Carlo simulations of small-animal x-ray fluorescence tomography

Medical Imaging 2020: Physics of Medical Imaging ◽

10.1117/12.2549566 ◽

2020 ◽

Author(s):

Kian Shaker ◽

Jakob C. Larsson ◽

Hans M. Hertz

Keyword(s):

Monte Carlo ◽

Monte Carlo Simulations ◽

Spatial Resolution ◽

High Spatial Resolution ◽

Small Animal ◽

Fluorescence Tomography ◽

X Ray

Download Full-text

High spatial resolution and non-destructive evaluation of wood density and microfibril angle by NIR hyperspectral imaging

NIR news ◽

10.1177/0960336017703259 ◽

2017 ◽

Vol 28 (5) ◽

pp. 7-12 ◽

Cited By ~ 1

Author(s):

Te Ma ◽

Tetsuya Inagaki ◽

Satoru Tsuchikawa

Keyword(s):

Hyperspectral Imaging ◽

Wood Density ◽

Spatial Resolution ◽

Cross Validation ◽

High Spatial Resolution ◽

Microfibril Angle ◽

Coefficient Of Determination ◽

X Ray ◽

Effective Manner ◽

Non Destructive

Wood density and microfibril angle are strongly correlated with wood stiffness, shrinkage, and anisotropy. Understanding the spatial distribution of these values is critical for solid timber applications. In this study, near infrared (NIR) hyperspectral imaging was used to evaluate wood density and microfibril angle in a non-destructive, yet effective manner. Briefly, five wood samples collected from both normal and compression parts of two different Cryptomeria japonica trees were analyzed. Partial least squares regression analysis was performed to determine the relationship between X-ray reference data and NIR spectra, and cross-validation (leave-one-out) was used for checking prediction performances. The validation coefficient of determination (r2) between predicted densities by the NIR technique and measured values by SilviScan (X-ray data) was 0.83 with a root mean squared error of cross-validation (RMSECV) of 105.18 kg/m3. Regarding microfibril angle, r2 and RMSECV were 0.77 and 5.36°, respectively. Finally, wood density and microfibril angle were successfully mapped at a high spatial resolution (156 µm) to facilitate the detection of annual growth ring features and evaluation of aspects of heterogeneous wood quality.

Download Full-text

Comparison Decision Tree and Logistic Regression Machine Learning Classification Algorithms to determine Covid-19

SinkrOn ◽

10.33395/sinkron.v7i1.11243 ◽

2022 ◽

Vol 7 (1) ◽

pp. 59-65

Author(s):

Artika Arista

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Decision Tree ◽

Cross Validation ◽

Performance Testing ◽

Signs And Symptoms ◽

Classification Algorithms ◽

Machine Learning Classification ◽

Wide Range ◽

Testing Performance

Many people today are unsure whether they have COVID-19. The frequent fever, dry cough, and sore throat are all signs and symptoms of COVID-19. If a person has signs or symptoms of coronavirus disease 2019 (COVID-19), he/she should see the doctor or go to a clinic as soon as possible. As a result, it's vital to learn and comprehend the fundamental differences. COVID-19 can cause a wide range of symptoms. The experiments were carried out using two Machine Learning Classification Algorithms, namely Decision Tree (DT) and Logistic Regression (LR). Both algorithms were written and analyzed using the Python program in Jupyter Notebook 6.4.5. From the results obtained in the experiments of covid symptoms dataset, on average, the DT model has obtained the best cross-validation average and the testing performance average compared to the LR machine learning models. For cross-validation results, the DT model has achieved an accuracy of 98.0%. For performance testing, the DT model has achieved an accuracy of 98.0%. The LR has obtained the second-best result on the average of cross-validation performance and the testing results. For cross-validation results, the LR model has achieved an accuracy of 96.0%. For performance testing, the LR model has achieved an accuracy of 97.0%. Consequently, the DT for the COVID-19 symptoms dataset is outperforming the LR for cross-validation and testing results.

Download Full-text

High Spatial Resolution Topsoil Organic Matter Content Mapping Across Desertified Land in Northern China

Frontiers in Environmental Science ◽

10.3389/fenvs.2021.668912 ◽

2021 ◽

Vol 9 ◽

Author(s):

Yang Junting ◽

Li Xiaosong ◽

Wu Bo ◽

Wu Junjun ◽

Sun Bin ◽

...

Keyword(s):

Machine Learning ◽

Organic Matter ◽

Spatial Resolution ◽

High Spatial Resolution ◽

Organic Matter Content ◽

Multiple Linear Regression Model ◽

Northern China ◽

Google Earth ◽

Support Vector ◽

Combating Desertification

Soil organic matter (SOM) content is an effective indicator of desertification; thus, monitoring its spatial‒temporal changes on a large scale is important for combating desertification. However, mapping SOM content in desertified land is challenging owing to the heterogeneous landscape, relatively low SOM content and vegetation coverage. Here, we modeled the SOM content in topsoil (0–20 cm) of desertified land in northern China by employing a high spatial resolution dataset and machine learning methods, with an emphasis on quarterly green and non-photosynthetic vegetation information, based on the Google Earth Engine (GEE). The results show: 1) the machine learning model performed better than the traditional multiple linear regression model (MLR) for SOM content estimation, and the Random Forest (RF) model was more accurate than the Support Vector Machine (SVM) model; 2) the quarterly information regarding green vegetation and non-photosynthetic were identified as key covariates for estimating the SOM content in desertified land, and an obvious improvement could be observed after simultaneously combining the Dead Fuel Index (DFI) and Normalized Difference Vegetation Index (NDVI) of the four quarters (R2 increased by 0.06, the root mean square error decreased by 0.05, the ratio of prediction deviation increased by 0.2, and the ratio of performance to interquartile distance increased by 0.5). In particular, the effects of the DFI in Q1 (the first quarter) and Q2 (the second quarter) on estimating low SOM content (<1%) were identified; finally, a timely (2019) and high spatial resolution (30 m) SOM content map for the desertified land in northern China was drawn which shows obvious advantages over existing SOM products, thus providing key data support for monitoring and combating desertification.

Download Full-text

INTEGRATION OF REMOTE SENSING DATA AND BASIC GEODATA AT DIFFERENT SCALE LEVELS FOR IMPROVED LAND USE ANALYSES

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xl-3-w3-85-2015 ◽

2015 ◽

Vol XL-3/W3 ◽

pp. 85-89 ◽

Cited By ~ 2

Author(s):

G. Waldhoff ◽

S. Eichfuss ◽

G. Bareth

Keyword(s):

Remote Sensing ◽

Land Use ◽

Spatial Resolution ◽

Geographical Information Systems ◽

High Spatial Resolution ◽

Regional Scale ◽

Remote Sensing Data ◽

Geographical Information ◽

Sensing Data ◽

Very High Spatial Resolution

The classification of remote sensing data is a standard method to retrieve up-to-date land use data at various scales. However, through the incorporation of additional data using geographical information systems (GIS) land use analyses can be enriched significantly. In this regard, the Multi-Data Approach (MDA) for the integration of remote sensing classifications and official basic geodata for a regional scale as well as the achievable results are summarised. On this methodological basis, we investigate the enhancement of land use analyses at a very high spatial resolution by combining WorldView-2 remote sensing data and official cadastral data for Germany (the Automated Real Estate Map, ALK). Our first results show that manifold thematic information and the improved geometric delineation of land use classes can be gained even at a high spatial resolution.

Download Full-text

Prediction of MicroRNA-Disease Associations Based on Social Network Analysis Methods

BioMed Research International ◽

10.1155/2015/810514 ◽

2015 ◽

Vol 2015 ◽

pp. 1-9 ◽

Cited By ~ 69

Author(s):

Quan Zou ◽

Jinjin Li ◽

Qingqi Hong ◽

Ziyu Lin ◽

Yun Wu ◽

...

Keyword(s):

Machine Learning ◽

Social Network ◽

Social Network Analysis ◽

Network Analysis ◽

Cross Validation ◽

Supervised Machine Learning ◽

Rna Molecules ◽

Disease Associations ◽

Endogenous Genes ◽

Leave One Out

MicroRNAs constitute an important class of noncoding, single-stranded, ~22 nucleotide long RNA molecules encoded by endogenous genes. They play an important role in regulating gene transcription and the regulation of normal development. MicroRNAs can be associated with disease; however, only a few microRNA-disease associations have been confirmed by traditional experimental approaches. We introduce two methods to predict microRNA-disease association. The first method, KATZ, focuses on integrating the social network analysis method with machine learning and is based on networks derived from known microRNA-disease associations, disease-disease associations, and microRNA-microRNA associations. The other method, CATAPULT, is a supervised machine learning method. We applied the two methods to 242 known microRNA-disease associations and evaluated their performance using leave-one-out cross-validation and 3-fold cross-validation. Experiments proved that our methods outperformed the state-of-the-art methods.

Download Full-text

Optimization of treatment strategy by using a machine learning model to predict survival time of patients with malignant glioma after radiotherapy

Journal of Radiation Research ◽

10.1093/jrr/rrz066 ◽

2019 ◽

Vol 60 (6) ◽

pp. 818-824 ◽

Cited By ~ 2

Author(s):

Takuya Mizutani ◽

Taiki Magome ◽

Hiroshi Igaki ◽

Akihiro Haga ◽

Kanabu Nawa ◽

...

Keyword(s):

Machine Learning ◽

Malignant Glioma ◽

Survival Time ◽

Treatment Duration ◽

Prediction Accuracy ◽

Cross Validation ◽

Learning Model ◽

Machine Learning Model ◽

Prescription Dose ◽

Leave One Out

ABSTRACT The purpose of this study was to predict the survival time of patients with malignant glioma after radiotherapy with high accuracy by considering additional clinical factors and optimize the prescription dose and treatment duration for individual patient by using a machine learning model. A total of 35 patients with malignant glioma were included in this study. The candidate features included 12 clinical features and 192 dose–volume histogram (DVH) features. The appropriate input features and parameters of the support vector machine (SVM) were selected using the genetic algorithm based on Akaike’s information criterion, i.e. clinical, DVH, and both clinical and DVH features. The prediction accuracy of the SVM models was evaluated through a leave-one-out cross-validation test with residual error, which was defined as the absolute difference between the actual and predicted survival times after radiotherapy. Moreover, the influences of various values of prescription dose and treatment duration on the predicted survival time were evaluated. The prediction accuracy was significantly improved with the combined use of clinical and DVH features compared with the separate use of both features (P < 0.01, Wilcoxon signed rank test). Mean ± standard deviation of the leave-one-out cross-validation using the combined clinical and DVH features, only clinical features and only DVH features were 104.7 ± 96.5, 144.2 ± 126.1 and 204.5 ± 186.0 days, respectively. The prediction accuracy could be improved with the combination of clinical and DVH features, and our results show the potential to optimize the treatment strategy for individual patients based on a machine learning model.

Download Full-text