scholarly journals Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification

2019 ◽  
Vol 11 (2) ◽  
pp. 185 ◽  
Author(s):  
Christopher A. Ramezan ◽  
Timothy A. Warner ◽  
Aaron E. Maxwell

High spatial resolution (1–5 m) remotely sensed datasets are increasingly being used to map land covers over large geographic areas using supervised machine learning algorithms. Although many studies have compared machine learning classification methods, sample selection methods for acquiring training and validation data for machine learning, and cross-validation techniques for tuning classifier parameters are rarely investigated, particularly on large, high spatial resolution datasets. This work, therefore, examines four sample selection methods—simple random, proportional stratified random, disproportional stratified random, and deliberative sampling—as well as three cross-validation tuning approaches—k-fold, leave-one-out, and Monte Carlo methods. In addition, the effect on the accuracy of localizing sample selections to a small geographic subset of the entire area, an approach that is sometimes used to reduce costs associated with training data collection, is investigated. These methods are investigated in the context of support vector machines (SVM) classification and geographic object-based image analysis (GEOBIA), using high spatial resolution National Agricultural Imagery Program (NAIP) orthoimagery and LIDAR-derived rasters, covering a 2,609 km2 regional-scale area in northeastern West Virginia, USA. Stratified-statistical-based sampling methods were found to generate the highest classification accuracy. Using a small number of training samples collected from only a subset of the study area provided a similar level of overall accuracy to a sample of equivalent size collected in a dispersed manner across the entire regional-scale dataset. There were minimal differences in accuracy for the different cross-validation tuning methods. The processing time for Monte Carlo and leave-one-out cross-validation were high, especially with large training sets. For this reason, k-fold cross-validation appears to be a good choice. Classifications trained with samples collected deliberately (i.e., not randomly) were less accurate than classifiers trained from statistical-based samples. This may be due to the high positive spatial autocorrelation in the deliberative training set. Thus, if possible, samples for training should be selected randomly; deliberative samples should be avoided.

2015 ◽  
Vol 8 (3) ◽  
pp. 3357-3397 ◽  
Author(s):  
D. J. Zawada ◽  
S. R. Dueck ◽  
L. A. Rieger ◽  
A. E. Bourassa ◽  
N. D. Lloyd ◽  
...  

Abstract. The OSIRIS instrument on board the Odin spacecraft has been measuring limb scattered radiance since 2001. The vertical radiance profiles measured as the instrument nods are inverted, with the aid of the SASKTRAN radiative transfer model, to obtain vertical profiles of trace atmospheric constituents. Here we describe two newly developed modes of the SASKTRAN radiative transfer model: a high spatial resolution mode, and a Monte Carlo mode. The high spatial resolution mode is a successive orders model capable of modelling the multiply scattered radiance when the atmosphere is not spherically symmetric; the Monte Carlo mode is intended for use as a highly accurate reference model. It is shown that the two models agree in a wide variety of solar conditions to within 0.2%. As an example case for both models, Odin-OSIRIS scans were simulated with the Monte Carlo model and retrieved using the high resolution model. A systematic bias of up to 4% in retrieved ozone number density between scans where the instrument is scanning up or scanning down was identified. It was found that calculating the multiply scattered diffuse field at five discrete solar zenith angles is sufficient to eliminate the bias for typical Odin-OSIRIS geometries.


NIR news ◽  
2017 ◽  
Vol 28 (5) ◽  
pp. 7-12 ◽  
Author(s):  
Te Ma ◽  
Tetsuya Inagaki ◽  
Satoru Tsuchikawa

Wood density and microfibril angle are strongly correlated with wood stiffness, shrinkage, and anisotropy. Understanding the spatial distribution of these values is critical for solid timber applications. In this study, near infrared (NIR) hyperspectral imaging was used to evaluate wood density and microfibril angle in a non-destructive, yet effective manner. Briefly, five wood samples collected from both normal and compression parts of two different Cryptomeria japonica trees were analyzed. Partial least squares regression analysis was performed to determine the relationship between X-ray reference data and NIR spectra, and cross-validation (leave-one-out) was used for checking prediction performances. The validation coefficient of determination (r2) between predicted densities by the NIR technique and measured values by SilviScan (X-ray data) was 0.83 with a root mean squared error of cross-validation (RMSECV) of 105.18 kg/m3. Regarding microfibril angle, r2 and RMSECV were 0.77 and 5.36°, respectively. Finally, wood density and microfibril angle were successfully mapped at a high spatial resolution (156 µm) to facilitate the detection of annual growth ring features and evaluation of aspects of heterogeneous wood quality.


SinkrOn ◽  
2022 ◽  
Vol 7 (1) ◽  
pp. 59-65
Author(s):  
Artika Arista

Many people today are unsure whether they have COVID-19. The frequent fever, dry cough, and sore throat are all signs and symptoms of COVID-19. If a person has signs or symptoms of coronavirus disease 2019 (COVID-19), he/she should see the doctor or go to a clinic as soon as possible. As a result, it's vital to learn and comprehend the fundamental differences. COVID-19 can cause a wide range of symptoms. The experiments were carried out using two Machine Learning Classification Algorithms, namely Decision Tree (DT) and Logistic Regression (LR). Both algorithms were written and analyzed using the Python program in Jupyter Notebook 6.4.5. From the results obtained in the experiments of covid symptoms dataset, on average, the DT model has obtained the best cross-validation average and the testing performance average compared to the LR machine learning models. For cross-validation results, the DT model has achieved an accuracy of 98.0%. For performance testing, the DT model has achieved an accuracy of 98.0%. The LR has obtained the second-best result on the average of cross-validation performance and the testing results. For cross-validation results, the LR model has achieved an accuracy of 96.0%. For performance testing, the LR model has achieved an accuracy of 97.0%. Consequently, the DT for the COVID-19 symptoms dataset is outperforming the LR for cross-validation and testing results.


2021 ◽  
Vol 9 ◽  
Author(s):  
Yang Junting ◽  
Li Xiaosong ◽  
Wu Bo ◽  
Wu Junjun ◽  
Sun Bin ◽  
...  

Soil organic matter (SOM) content is an effective indicator of desertification; thus, monitoring its spatial‒temporal changes on a large scale is important for combating desertification. However, mapping SOM content in desertified land is challenging owing to the heterogeneous landscape, relatively low SOM content and vegetation coverage. Here, we modeled the SOM content in topsoil (0–20 cm) of desertified land in northern China by employing a high spatial resolution dataset and machine learning methods, with an emphasis on quarterly green and non-photosynthetic vegetation information, based on the Google Earth Engine (GEE). The results show: 1) the machine learning model performed better than the traditional multiple linear regression model (MLR) for SOM content estimation, and the Random Forest (RF) model was more accurate than the Support Vector Machine (SVM) model; 2) the quarterly information regarding green vegetation and non-photosynthetic were identified as key covariates for estimating the SOM content in desertified land, and an obvious improvement could be observed after simultaneously combining the Dead Fuel Index (DFI) and Normalized Difference Vegetation Index (NDVI) of the four quarters (R2 increased by 0.06, the root mean square error decreased by 0.05, the ratio of prediction deviation increased by 0.2, and the ratio of performance to interquartile distance increased by 0.5). In particular, the effects of the DFI in Q1 (the first quarter) and Q2 (the second quarter) on estimating low SOM content (<1%) were identified; finally, a timely (2019) and high spatial resolution (30 m) SOM content map for the desertified land in northern China was drawn which shows obvious advantages over existing SOM products, thus providing key data support for monitoring and combating desertification.


Author(s):  
G. Waldhoff ◽  
S. Eichfuss ◽  
G. Bareth

The classification of remote sensing data is a standard method to retrieve up-to-date land use data at various scales. However, through the incorporation of additional data using geographical information systems (GIS) land use analyses can be enriched significantly. In this regard, the Multi-Data Approach (MDA) for the integration of remote sensing classifications and official basic geodata for a regional scale as well as the achievable results are summarised. On this methodological basis, we investigate the enhancement of land use analyses at a very high spatial resolution by combining WorldView-2 remote sensing data and official cadastral data for Germany (the Automated Real Estate Map, ALK). Our first results show that manifold thematic information and the improved geometric delineation of land use classes can be gained even at a high spatial resolution.


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Quan Zou ◽  
Jinjin Li ◽  
Qingqi Hong ◽  
Ziyu Lin ◽  
Yun Wu ◽  
...  

MicroRNAs constitute an important class of noncoding, single-stranded, ~22 nucleotide long RNA molecules encoded by endogenous genes. They play an important role in regulating gene transcription and the regulation of normal development. MicroRNAs can be associated with disease; however, only a few microRNA-disease associations have been confirmed by traditional experimental approaches. We introduce two methods to predict microRNA-disease association. The first method, KATZ, focuses on integrating the social network analysis method with machine learning and is based on networks derived from known microRNA-disease associations, disease-disease associations, and microRNA-microRNA associations. The other method, CATAPULT, is a supervised machine learning method. We applied the two methods to 242 known microRNA-disease associations and evaluated their performance using leave-one-out cross-validation and 3-fold cross-validation. Experiments proved that our methods outperformed the state-of-the-art methods.


2019 ◽  
Vol 60 (6) ◽  
pp. 818-824 ◽  
Author(s):  
Takuya Mizutani ◽  
Taiki Magome ◽  
Hiroshi Igaki ◽  
Akihiro Haga ◽  
Kanabu Nawa ◽  
...  

ABSTRACT The purpose of this study was to predict the survival time of patients with malignant glioma after radiotherapy with high accuracy by considering additional clinical factors and optimize the prescription dose and treatment duration for individual patient by using a machine learning model. A total of 35 patients with malignant glioma were included in this study. The candidate features included 12 clinical features and 192 dose–volume histogram (DVH) features. The appropriate input features and parameters of the support vector machine (SVM) were selected using the genetic algorithm based on Akaike’s information criterion, i.e. clinical, DVH, and both clinical and DVH features. The prediction accuracy of the SVM models was evaluated through a leave-one-out cross-validation test with residual error, which was defined as the absolute difference between the actual and predicted survival times after radiotherapy. Moreover, the influences of various values of prescription dose and treatment duration on the predicted survival time were evaluated. The prediction accuracy was significantly improved with the combined use of clinical and DVH features compared with the separate use of both features (P < 0.01, Wilcoxon signed rank test). Mean ± standard deviation of the leave-one-out cross-validation using the combined clinical and DVH features, only clinical features and only DVH features were 104.7 ± 96.5, 144.2 ± 126.1 and 204.5 ± 186.0 days, respectively. The prediction accuracy could be improved with the combination of clinical and DVH features, and our results show the potential to optimize the treatment strategy for individual patients based on a machine learning model.


Sign in / Sign up

Export Citation Format

Share Document