Efficiency of Extreme Gradient Boosting for Imbalanced Land Cover Classification Using an Extended Margin and Disagreement Performance

Fei Sun; Run Wang; Bo Wan; Yanjun Su; Qinghua Guo; Youxin Huang; Xincai Wu

doi:10.3390/ijgi8070315

Efficiency of Extreme Gradient Boosting for Imbalanced Land Cover Classification Using an Extended Margin and Disagreement Performance

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi8070315 ◽

2019 ◽

Vol 8 (7) ◽

pp. 315 ◽

Cited By ~ 1

Author(s):

Fei Sun ◽

Run Wang ◽

Bo Wan ◽

Yanjun Su ◽

Qinghua Guo ◽

...

Keyword(s):

Land Cover ◽

Error Component ◽

Training Data ◽

Gradient Boosting ◽

Correct Classification ◽

Imbalanced Learning ◽

Minority Class ◽

Extreme Gradient Boosting ◽

Spectral Separability ◽

The Impact

Imbalanced learning is a methodological challenge in remote sensing communities, especially in complex areas where the spectral similarity exists between land covers. Obtaining high-confidence classification results for imbalanced class issues is highly important in practice. In this paper, extreme gradient boosting (XGB), a novel tree-based ensemble system, is employed to classify the land cover types in Very-high resolution (VHR) images with imbalanced training data. We introduce an extended margin criterion and disagreement performance to evaluate the efficiency of XGB in imbalanced learning situations and examine the effect of minority class spectral separability on model performance. The results suggest that the uncertainty of XGB associated with correct classification is stable. The average probability-based margin of correct classification provided by XGB is 0.82, which is about 46.30% higher than that by random forest (RF) method (0.56). Moreover, the performance uncertainty of XGB is insensitive to spectral separability after the sample imbalance reached a certain level (minority:majority > 10:100). The impact of sample imbalance on the minority class is also related to its spectral separability, and XGB performs better than RF in terms of user accuracy for the minority class with imperfect separability. The disagreement components of XGB are better and more stable than RF with imbalanced samples, especially for complex areas with more types. In addition, appropriate sample imbalance helps to improve the trade-off between the recognition accuracy of XGB and the sample cost. According to our analysis, this margin-based uncertainty assessment and disagreement performance can help users identify the confidence level and error component in similar classification performance (overall, producer, and user accuracies).

Download Full-text

An Impartial Semi-Supervised Learning Strategy for Imbalanced Classification on VHR Images

Sensors ◽

10.3390/s20226699 ◽

2020 ◽

Vol 20 (22) ◽

pp. 6699

Author(s):

Fei Sun ◽

Fang Fang ◽

Run Wang ◽

Bo Wan ◽

Qinghua Guo ◽

...

Keyword(s):

Remote Sensing ◽

Supervised Learning ◽

Learning Strategy ◽

Gradient Boosting ◽

Support Vector ◽

Imbalanced Learning ◽

Learning Methods ◽

Minority Class ◽

Imbalanced Classification ◽

Extreme Gradient Boosting

Imbalanced learning is a common problem in remote sensing imagery-based land-use and land-cover classifications. Imbalanced learning can lead to a reduction in classification accuracy and even the omission of the minority class. In this paper, an impartial semi-supervised learning strategy based on extreme gradient boosting (ISS-XGB) is proposed to classify very high resolution (VHR) images with imbalanced data. ISS-XGB solves multi-class classification by using several semi-supervised classifiers. It first employs multi-group unlabeled data to eliminate the imbalance of training samples and then utilizes gradient boosting-based regression to simulate the target classes with positive and unlabeled samples. In this study, experiments were conducted on eight study areas with different imbalanced situations. The results showed that ISS-XGB provided a comparable but more stable performance than most commonly used classification approaches (i.e., random forest (RF), XGB, multilayer perceptron (MLP), and support vector machine (SVM)), positive and unlabeled learning (PU-Learning) methods (PU-BP and PU-SVM), and typical synthetic sample-based imbalanced learning methods. Especially under extremely imbalanced situations, ISS-XGB can provide high accuracy for the minority class without losing overall performance (the average overall accuracy achieves 85.92%). The proposed strategy has great potential in solving the imbalanced classification problems in remote sensing.

Download Full-text

Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival

Scientific Reports ◽

10.1038/s41598-021-86327-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Arturo Moncada-Torres ◽

Marissa C. van Maaren ◽

Mathijs P. Hendriks ◽

Sabine Siesling ◽

Gijs Geleijnse

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Explicit Knowledge ◽

Cox Regression ◽

Metastatic Breast ◽

Gradient Boosting ◽

Support Vector ◽

Netherlands Cancer Registry ◽

Extreme Gradient Boosting ◽

The Impact

AbstractCox Proportional Hazards (CPH) analysis is the standard for survival analysis in oncology. Recently, several machine learning (ML) techniques have been adapted for this task. Although they have shown to yield results at least as good as classical methods, they are often disregarded because of their lack of transparency and little to no explainability, which are key for their adoption in clinical settings. In this paper, we used data from the Netherlands Cancer Registry of 36,658 non-metastatic breast cancer patients to compare the performance of CPH with ML techniques (Random Survival Forests, Survival Support Vector Machines, and Extreme Gradient Boosting [XGB]) in predicting survival using the $$c$$ c -index. We demonstrated that in our dataset, ML-based models can perform at least as good as the classical CPH regression ($$c$$ c -index $$\sim \,0.63$$ ∼ 0.63 ), and in the case of XGB even better ($$c$$ c -index $$\sim 0.73$$ ∼ 0.73 ). Furthermore, we used Shapley Additive Explanation (SHAP) values to explain the models’ predictions. We concluded that the difference in performance can be attributed to XGB’s ability to model nonlinearities and complex interactions. We also investigated the impact of specific features on the models’ predictions as well as their corresponding insights. Lastly, we showed that explainable ML can generate explicit knowledge of how models make their predictions, which is crucial in increasing the trust and adoption of innovative ML techniques in oncology and healthcare overall.

Download Full-text

Exploring the Mechanism of Crashes with Autonomous Vehicles Using Machine Learning

Mathematical Problems in Engineering ◽

10.1155/2021/5524356 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Hengrui Chen ◽

Hong Chen ◽

Ruiyu Zhou ◽

Zhizhen Liu ◽

Xiaoke Sun

Keyword(s):

Machine Learning ◽

Autonomous Vehicles ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Crash Severity ◽

Apriori Algorithm ◽

Driving Mode ◽

Extreme Gradient Boosting ◽

The Impact

The safety issue has become a critical obstacle that cannot be ignored in the marketization of autonomous vehicles (AVs). The objective of this study is to explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. We use the Apriori algorithm to explore the causal relationship between multiple factors to explore the mechanism of crashes. We use various machine learning models, including support vector machine (SVM), classification and regression tree (CART), and eXtreme Gradient Boosting (XGBoost), to analyze the crash severity. Besides, we apply the Shapley Additive Explanations (SHAP) to interpret the importance of each factor. The results indicate that XGBoost obtains the best result (recall = 75%; G-mean = 67.82%). Both XGBoost and Apriori algorithm effectively provided meaningful insights about AV-involved crash characteristics and their relationship. Among all these features, vehicle damage, weather conditions, accident location, and driving mode are the most critical features. We found that most rear-end crashes are conventional vehicles bumping into the rear of AVs. Drivers should be extremely cautious when driving in fog, snow, and insufficient light. Besides, drivers should be careful when driving near intersections, especially in the autonomous driving mode.

Download Full-text

Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00028 ◽

2020 ◽

pp. 865-874

Author(s):

Enrico Santus ◽

Tal Schuster ◽

Amir M. Tahmasebi ◽

Clara Li ◽

Adam Yala ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Systems ◽

High Performance ◽

Feature Model ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Extreme Gradient Boosting ◽

Pathology Reports

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.

Download Full-text

Noise Prediction Using Machine Learning with Measurements Analysis

Applied Sciences ◽

10.3390/app10186619 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6619

Author(s):

Po-Jiun Wen ◽

Chihpin Huang

Keyword(s):

Machine Learning ◽

Noise Exposure ◽

Learning Model ◽

Training Data ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Noise Prediction ◽

Time Duration ◽

Proposed Model ◽

The Impact

The noise prediction using machine learning is a special study that has recently received increased attention. This is particularly true in workplaces with noise pollution, which increases noise exposure for general laborers. This study attempts to analyze the noise equivalent level (Leq) at the National Synchrotron Radiation Research Center (NSRRC) facility and establish a machine learning model for noise prediction. This study utilized the gradient boosting model (GBM) as the learning model in which past noise measurement records and many other features are integrated as the proposed model makes a prediction. This study analyzed the time duration and frequency of the collected Leq and also investigated the impact of training data selection. The results presented in this paper indicate that the proposed prediction model works well in almost noise sensors and frequencies. Moreover, the model performed especially well in sensor 8 (125 Hz), which was determined to be a serious noise zone in the past noise measurements. The results also show that the root-mean-square-error (RMSE) of the predicted harmful noise was less than 1 dBA and the coefficient of determination (R2) value was greater than 0.7. That is, the working field showed a favorable noise prediction performance using the proposed method. This positive result shows the ability of the proposed approach in noise prediction, thus providing a notification to the laborer to prevent long-term exposure. In addition, the proposed model accurately predicts noise future pollution, which is essential for laborers in high-noise environments. This would keep employees healthy in avoiding noise harmful positions to prevent people from working in that environment.

Download Full-text

A Study on Machine Vision Techniques for the Inspection of Health Personnels’ Protective Suits for the Treatment of Patients in Extreme Isolation

Electronics ◽

10.3390/electronics8070743 ◽

2019 ◽

Vol 8 (7) ◽

pp. 743 ◽

Cited By ~ 1

Author(s):

Alice Stazio ◽

Juan G. Victores ◽

David Estevez ◽

Carlos Balaguer

Keyword(s):

Logistic Regression ◽

Machine Vision ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Classification Algorithms ◽

Adaptive Boosting ◽

Blood Stains ◽

Extreme Gradient Boosting ◽

Vector Machines

The examination of Personal Protective Equipment (PPE) to assure the complete integrity of health personnel in contact with infected patients is one of the most necessary tasks when treating patients affected by infectious diseases, such as Ebola. This work focuses on the study of machine vision techniques for the detection of possible defects on the PPE that could arise after contact with the aforementioned pathological patients. A preliminary study on the use of image classification algorithms to identify blood stains on PPE subsequent to the treatment of the infected patient is presented. To produce training data for these algorithms, a synthetic dataset was generated from a simulated model of a PPE suit with blood stains. Furthermore, the study proceeded with the utilization of images of the PPE with a physical emulation of blood stains, taken by a real prototype. The dataset reveals a great imbalance between positive and negative samples; therefore, all the selected classification algorithms are able to manage this kind of data. Classifiers range from Logistic Regression and Support Vector Machines, to bagging and boosting techniques such as Random Forest, Adaptive Boosting, Gradient Boosting and eXtreme Gradient Boosting. All these algorithms were evaluated on accuracy, precision, recall and F 1 score; and additionally, execution times were considered. The obtained results report promising outcomes of all the classifiers, and, in particular Logistic Regression resulted to be the most suitable classification algorithm in terms of F 1 score and execution time, considering both datasets.

Download Full-text

Rack Temperature Prediction Model Using Machine Learning after Stopping Computer Room Air Conditioner in Server Room

Energies ◽

10.3390/en13174300 ◽

2020 ◽

Vol 13 (17) ◽

pp. 4300

Author(s):

Kosuke Sasakura ◽

Takeshi Aoki ◽

Masayoshi Komatsu ◽

Takeshi Watanabe

Keyword(s):

Machine Learning ◽

High Heat ◽

Training Data ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Air Conditioner ◽

Tree Model ◽

Explanatory Variables ◽

Temperature Environment ◽

The Impact

Data centers (DCs) are becoming increasingly important in recent years, and highly efficient and reliable operation and management of DCs is now required. The generated heat density of the rack and information and communication technology (ICT) equipment is predicted to get higher in the future, so it is crucial to maintain the appropriate temperature environment in the server room where high heat is generated in order to ensure continuous service. It is especially important to predict changes of rack intake temperature in the server room when the computer room air conditioner (CRAC) is shut down, which can cause a rapid rise in temperature. However, it is quite difficult to predict the rack temperature accurately, which in turn makes it difficult to determine the impact on service in advance. In this research, we propose a model that predicts the rack intake temperature after the CRAC is shut down. Specifically, we use machine learning to construct a gradient boosting decision tree model with data from the CRAC, ICT equipment, and rack intake temperature. Experimental results demonstrate that the proposed method has a very high prediction accuracy: the coefficient of determination was 0.90 and the root mean square error (RMSE) was 0.54. Our model makes it possible to evaluate the impact on service and determine if action to maintain the temperature environment is required. We also clarify the effect of explanatory variables and training data of the machine learning on the model accuracy.

Download Full-text

XGBoost-Based Day-Ahead Load Forecasting Algorithm Considering Behind-the-Meter Solar PV Generation

Energies ◽

10.3390/en15010128 ◽

2021 ◽

Vol 15 (1) ◽

pp. 128

Author(s):

Dong-Jin Bae ◽

Bo-Sung Kwon ◽

Kyung-Bin Song

Keyword(s):

Load Forecasting ◽

Rapid Expansion ◽

Gradient Boosting ◽

Base Temperature ◽

Electric Load ◽

Solar Pv ◽

Model Case ◽

Extreme Gradient Boosting ◽

Pv Generation ◽

The Impact

With the rapid expansion of renewable energy, the penetration rate of behind-the-meter (BTM) solar photovoltaic (PV) generators is increasing in South Korea. The BTM solar PV generation is not metered in real-time, distorts the electric load and increases the errors of load forecasting. In order to overcome the problems caused by the impact of BTM solar PV generation, an extreme gradient boosting (XGBoost) load forecasting algorithm is proposed. The capacity of the BTM solar PV generators is estimated based on an investigation of the deviation of load using a grid search. The influence of external factors was considered by using the fluctuation of the load used by lighting appliances and data filtering based on base temperature, as a result, the capacity of the BTM solar PV generators is accurately estimated. The distortion of electric load is eliminated by the reconstituted load method that adds the estimated BTM solar PV generation to the electric load, and the load forecasting is conducted using the XGBoost model. Case studies are performed to demonstrate the accuracy of prediction for the proposed method. The accuracy of the proposed algorithm was improved by 21% and 29% in 2019 and 2020, respectively, compared with the MAPE of the LSTM model that does not reflect the impact of BTM solar PV.

Download Full-text

Remote Diagnosis and Triaging Model for Skin Cancer Using EfficientNet and Extreme Gradient Boosting

Complexity ◽

10.1155/2021/5591614 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Irfan Ullah Khan ◽

Nida Aslam ◽

Talha Anwar ◽

Sumayh S. Aljameel ◽

Mohib Ullah ◽

...

Keyword(s):

Skin Cancer ◽

Skin Lesion ◽

Clinical Data ◽

Cancer Diagnosis ◽

Gradient Boosting ◽

Automated Diagnosis ◽

Data Set ◽

Diagnosis System ◽

Extreme Gradient Boosting ◽

The Impact

Due to the successful application of machine learning techniques in several fields, automated diagnosis system in healthcare has been increasing at a high rate. The aim of the study is to propose an automated skin cancer diagnosis and triaging model and to explore the impact of integrating the clinical features in the diagnosis and enhance the outcomes achieved by the literature study. We used an ensemble-learning framework, consisting of the EfficientNetB3 deep learning model for skin lesion analysis and Extreme Gradient Boosting (XGB) for clinical data. The study used PAD-UFES-20 data set consisting of six unbalanced categories of skin cancer. To overcome the data imbalance, we used data augmentation. Experiments were conducted using skin lesion merely and the combination of skin lesion and clinical data. We found that integration of clinical data with skin lesions enhances automated diagnosis accuracy. Moreover, the proposed model outperformed the results achieved by the previous study for the PAD-UFES-20 data set with an accuracy of 0.78, precision of 0.89, recall of 0.86, and F1 of 0.88. In conclusion, the study provides an improved automated diagnosis system to aid the healthcare professional and patients for skin cancer diagnosis and remote triaging.

Download Full-text

An Interpretable Extreme Gradient Boosting Model to Predict Ash Fusion Temperatures

Minerals ◽

10.3390/min10060487 ◽

2020 ◽

Vol 10 (6) ◽

pp. 487

Author(s):

Maciej Rzychoń ◽

Alina Żogała ◽

Leokadia Róg

Keyword(s):

Coefficient Of Determination ◽

Gradient Boosting ◽

Important Indicator ◽

Upper Silesian Coal Basin ◽

Proposed Model ◽

Extreme Gradient Boosting ◽

The Impact ◽

Partial Dependence ◽

Individual Input

The hemispherical temperature (HT) is the most important indicator representing ash fusion temperatures (AFTs) in the Polish industry to assess the suitability of coal for combustion as well as gasification purposes. It is important, for safe operation and energy saving, to know or to be able to predict value of this parameter. In this study a non-linear model predicting the HT value, based on ash oxides content for 360 coal samples from the Upper Silesian Coal Basin, was developed. The proposed model was established using the machine learning method—extreme gradient boosting (XGBoost) regressor. An important feature of models based on the XGBoost algorithm is the ability to determine the impact of individual input parameters on the predicted value using the feature importance (FI) technique. This method allowed the determination of ash oxides having the greatest impact on the projected HT. Then, the partial dependence plots (PDP) technique was used to visualize the effect of individual oxides on the predicted value. The results indicate that proposed model could estimate value of HT with high accuracy. The coefficient of determination (R2) of the prediction has reached satisfactory value of 0.88.

Download Full-text