Fine-Resolution Mapping of Soil Total Nitrogen across China Based on Weighted Model Averaging

Yue Zhou; Jie Xue; Songchao Chen; Yin Zhou; Zongzheng Liang; Nan Wang; Zhou Shi

doi:10.3390/rs12010085

Fine-Resolution Mapping of Soil Total Nitrogen across China Based on Weighted Model Averaging

Remote Sensing ◽

10.3390/rs12010085 ◽

2019 ◽

Vol 12 (1) ◽

pp. 85 ◽

Cited By ~ 6

Author(s):

Yue Zhou ◽

Jie Xue ◽

Songchao Chen ◽

Yin Zhou ◽

Zongzheng Liang ◽

...

Keyword(s):

Total Nitrogen ◽

Soil Depth ◽

Arable Land ◽

Model Averaging ◽

Nutrient Status ◽

Gradient Boosting ◽

Validation Data ◽

Data Set ◽

Extreme Gradient Boosting ◽

Weighted Model

Accurate estimates of the spatial distribution of total nitrogen (TN) in soil are fundamental for soil quality assessment, decision making in land management, and global nitrogen cycle modeling. In China, current maps are limited to individual regions or are of coarse resolution. In this study, we compiled a new 90-m resolution map of soil TN in China by the weighted summation of random forest and extreme gradient boosting. After harmonizing soil data from 4022 soil profiles into a fixed soil depth (0–20 cm) by equal area spline, 18 environmental covariates were employed to characterize the spatial pattern of soil TN in topsoil across China. The accuracy assessments from independent validation data showed that the weighted model averaging gave the best predictions with an acceptable R2 (0.41). The prediction map showed that high-value areas of soil TN were mainly distributed in the eastern Tibetan Plateau, central Qilian Mountains and the north of the Greater Khingan Range. Climate factors had a considerable influence on the variation of the soil TN, and land-use types played a pivotal part in each climate zone. This high-resolution and high-quality soil TN data set in China can be very useful for future inventories of soil nitrogen, assessments of soil nutrient status, and management of arable land.

Download Full-text

Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00028 ◽

2020 ◽

pp. 865-874

Author(s):

Enrico Santus ◽

Tal Schuster ◽

Amir M. Tahmasebi ◽

Clara Li ◽

Adam Yala ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Systems ◽

High Performance ◽

Feature Model ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Extreme Gradient Boosting ◽

Pathology Reports

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.

Download Full-text

Nutrient status of rangeland in upper Mustang

Banko Janakari ◽

10.3126/banko.v24i1.13489 ◽

2015 ◽

Vol 24 (1) ◽

pp. 41-46

Author(s):

M. Maharjan ◽

K. D. Awasthi ◽

K. R. Pande ◽

N. Thapa

Keyword(s):

Total Nitrogen ◽

Soil Depth ◽

Soil Nutrient ◽

Nutrient Status ◽

Available Phosphorus ◽

Top Soil ◽

Palatable Species ◽

North And South ◽

Nitrogen Phosphorus ◽

Livestock Rearing

The study aimed at assessing the nutrient status of rangeland in upper Mustang. The assessment is necessary to know about the soil quality or productivity of soil of rangeland. Livestock rearing is one of the main occupations in upper Mustang but nowadays due to lack of palatable species for livestock, people are leaving the occupation which is directly affecting their livelihood status. Therefore this research was carried out to find out if the soil nutrient is the reason behind the lack of availability of palatable species in the rangeland. For soil sampling, north and south aspects were taken. In case of altitude, 3850 m, 3650 m and 3450 m were taken. Soil samples were taken from soil profile up to 60cm depth at interval of 20 cm. Available phosphorus and available potassium were found to be high at north aspect but total nitrogen was found to be high at south aspect. Both total nitrogen and available phosphorus were found to be high at 3650 m. Available potassium was gradually decreased with increasing altitude. Total nitrogen, available potassium and available phosphorus were gradually decreased with increasing soil depth. Nutrient status was high at top soil (0-20 cm).The soil nutrient (Nitrogen, Phosphorus, Potassium) status was found to be good in the study area. Further research on biophysical and ecological aspect of Rangeland in Upper Mustang is necessary to manage it properly.Banko Janakari, Vol. 24, No. 1, PP. 41-46

Download Full-text

Mapping of soil properties at high resolution in Switzerland using boosted geoadditive models

10.5194/soil-2017-13 ◽

2017 ◽

Cited By ~ 1

Author(s):

Madlene Nussbaum ◽

Lorenz Walthert ◽

Marielle Fraefel ◽

Lucie Greiner ◽

Andreas Papritz

Keyword(s):

High Resolution ◽

Soil Properties ◽

Agricultural Land ◽

Model Building ◽

Soil Depth ◽

Environmental Data ◽

Gradient Boosting ◽

Validation Data ◽

Large Sets ◽

Skill Scores

Abstract. High-resolution maps of soil properties are a prerequisite for assessing soil threats and soil functions and to foster sustainable use of soil resources. For many regions in the world precise maps of soil properties are missing, but often sparsely sampled and discontinuous (legacy) soil data are available. Soil property data (response) can then be related by digital soil mapping (DSM) to spatially exhaustive environmental data that describe soil forming factors (covariates) to create spatially continuous maps. With air- and spaceborne remote sensing data and multi-scale terrain analysis large sets of covariates have become common. Building parsimonious models, amenable to pedological interpretation, is then a challenging task. We propose a new boosted geoadditive modelling framework (geoGAM) for DSM. A geoGAM models smooth nonlinear relations between responses and single covariates and combines these model terms additively. Residual spatial autocorrelation is captured by a smooth function of spatial coordinates and nonstationary effects are included by interactions between covariates and smooth spatial functions. The core of fully automated model building for geoGAM is componentwise gradient boosting. We illustrate the application of the geoGAM framework by using soil data from the Canton of Zurich, Switzerland. We modelled effective cation exchange capacity (ECEC) in forest topsoils as continuous response. For agricultural land we predicted the presence of waterlogged horizons in given soil depth layers as binary and drainage classes as ordinal responses. For the latter we used proportional odds geoGAM taking the ordering of the response properly into account. Fitted geoGAM contained only few covariates (7 to 17) selected from large sets (333 covariates for forests, 498 for agricultural land). Model sparsity allowed covariate interpretation by partial effects plots. Prediction intervals were computed by model-based bootstrapping for ECEC. Predictive performance of the fitted geoGAM, tested with independent validation data and specific skill scores (SS) for continuous, binary and ordinal responses, compared well with other studies that modelled similar soil properties. SS of 0.23 up to 0.53 (with SS = 1 for perfect predictions and SS = 0 for zero explained variance) were achieved depending on response and type of score. geoGAM combines efficient model building from large sets of covariates with ease of effect interpretation and therefore likely raises the acceptance of DSM products by end-users.

Download Full-text

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis (Preprint)

10.2196/preprints.27344 ◽

2021 ◽

Author(s):

Sang Min Nam ◽

Thomas A Peterson ◽

Kyoung Yul Seo ◽

Hyun Wook Han ◽

Jee In Kang

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Network Analysis ◽

Survey Data ◽

Associated Factors ◽

Statistical Tests ◽

Epidemiological Studies ◽

Gradient Boosting ◽

Data Set ◽

Extreme Gradient Boosting

BACKGROUND In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. OBJECTIVE Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. METHODS An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. RESULTS The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (<i>P</i><.05) and indirect (<i>P</i>≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. CONCLUSIONS XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.

Download Full-text

Remote Diagnosis and Triaging Model for Skin Cancer Using EfficientNet and Extreme Gradient Boosting

Complexity ◽

10.1155/2021/5591614 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Irfan Ullah Khan ◽

Nida Aslam ◽

Talha Anwar ◽

Sumayh S. Aljameel ◽

Mohib Ullah ◽

...

Keyword(s):

Skin Cancer ◽

Skin Lesion ◽

Clinical Data ◽

Cancer Diagnosis ◽

Gradient Boosting ◽

Automated Diagnosis ◽

Data Set ◽

Diagnosis System ◽

Extreme Gradient Boosting ◽

The Impact

Due to the successful application of machine learning techniques in several fields, automated diagnosis system in healthcare has been increasing at a high rate. The aim of the study is to propose an automated skin cancer diagnosis and triaging model and to explore the impact of integrating the clinical features in the diagnosis and enhance the outcomes achieved by the literature study. We used an ensemble-learning framework, consisting of the EfficientNetB3 deep learning model for skin lesion analysis and Extreme Gradient Boosting (XGB) for clinical data. The study used PAD-UFES-20 data set consisting of six unbalanced categories of skin cancer. To overcome the data imbalance, we used data augmentation. Experiments were conducted using skin lesion merely and the combination of skin lesion and clinical data. We found that integration of clinical data with skin lesions enhances automated diagnosis accuracy. Moreover, the proposed model outperformed the results achieved by the previous study for the PAD-UFES-20 data set with an accuracy of 0.78, precision of 0.89, recall of 0.86, and F1 of 0.88. In conclusion, the study provides an improved automated diagnosis system to aid the healthcare professional and patients for skin cancer diagnosis and remote triaging.

Download Full-text

Application of Bayesian Hyperparameter Optimized Random Forest and XGBoost Model for Landslide Susceptibility Mapping

Frontiers in Earth Science ◽

10.3389/feart.2021.712240 ◽

2021 ◽

Vol 9 ◽

Author(s):

Shibao Wang ◽

Jianqi Zhuang ◽

Jia Zheng ◽

Hongyu Fan ◽

Jiaxu Kong ◽

...

Keyword(s):

Random Forest ◽

Decision Tree ◽

Landslide Susceptibility ◽

Susceptibility Mapping ◽

Landslide Susceptibility Mapping ◽

Gradient Boosting ◽

The Loess Plateau ◽

Tree Model ◽

Validation Data ◽

Extreme Gradient Boosting

Landslides are widely distributed worldwide and often result in tremendous casualties and economic losses, especially in the Loess Plateau of China. Taking Wuqi County in the hinterland of the Loess Plateau as the research area, using Bayesian hyperparameters to optimize random forest and extreme gradient boosting decision trees model for landslide susceptibility mapping, and the two optimized models are compared. In addition, 14 landslide influencing factors are selected, and 734 landslides are obtained according to field investigation and reports from literals. The landslides were randomly divided into training data (70%) and validation data (30%). The hyperparameters of the random forest and extreme gradient boosting decision tree models were optimized using a Bayesian algorithm, and then the optimal hyperparameters are selected for landslide susceptibility mapping. Both models were evaluated and compared using the receiver operating characteristic curve and confusion matrix. The results show that the AUC validation data of the Bayesian optimized random forest and extreme gradient boosting decision tree model are 0.88 and 0.86, respectively, which showed an improvement of 4 and 3%, indicating that the prediction performance of the two models has been improved. However, the random forest model has a higher predictive ability than the extreme gradient boosting decision tree model. Thus, hyperparameter optimization is of great significance in the improvement of the prediction accuracy of the model. Therefore, the optimized model can generate a high-quality landslide susceptibility map.

Download Full-text

KDClassifier: A urinary proteomic spectra analysis tool based on machine learning for the classification of kidney diseases

Aging Pathobiology and Therapeutics ◽

10.31491/apt.2021.09.064 ◽

2021 ◽

Vol 3 (3) ◽

pp. 63-72

Author(s):

Wanjun Zhao ◽

Keyword(s):

Kidney Disease ◽

Kidney Diseases ◽

Confusion Matrix ◽

Gradient Boosting ◽

Support Vector ◽

Diagnostic Model ◽

Analysis Tool ◽

Data Set ◽

Extreme Gradient Boosting

Background: We aimed to establish a novel diagnostic model for kidney diseases by combining artificial intelligence with complete mass spectrum information from urinary proteomics. Methods: We enrolled 134 patients (IgA nephropathy, membranous nephropathy, and diabetic kidney disease) and 68 healthy participants as controls, with a total of 610,102 mass spectra from their urinary proteomic profiles. The training data set (80%) was used to create a diagnostic model using XGBoost, random forest (RF), a support vector machine (SVM), and artificial neural networks (ANNs). The diagnostic accuracy was evaluated using a confusion matrix with a test dataset (20%). We also constructed receiver operating-characteristic, Lorenz, and gain curves to evaluate the diagnostic model. Results: Compared with the RF, SVM, and ANNs, the modified XGBoost model, called Kidney Disease Classifier (KDClassifier), showed the best performance. The accuracy of the XGBoost diagnostic model was 96.03%. The area under the curve of the extreme gradient boosting (XGBoost) model was 0.952 (95% confidence interval, 0.9307–0.9733). The Kolmogorov-Smirnov (KS) value of the Lorenz curve was 0.8514. The Lorenz and gain curves showed the strong robustness of the developed model. Conclusions: The KDClassifier achieved high accuracy and robustness and thus provides a potential tool for the classification of kidney diseases

Download Full-text

A Machine Learning Study to Improve Surgical Case Duration Prediction

10.21203/rs.3.rs-40927/v1 ◽

2020 ◽

Author(s):

Ching-Chieh Huang ◽

Jesyin Lai ◽

Der-Yang Cho ◽

Jiaxin Yu

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Healthcare Management ◽

Gradient Boosting ◽

External Evaluation ◽

Data Set ◽

Surgical Case ◽

Case Duration ◽

Extreme Gradient Boosting ◽

Duration Prediction

Abstract Since the emergence of COVID-19, many hospitals have encountered challenges in performing efficient scheduling and good resource management to ensure the quality of healthcare provided to patients is not compromised. Operating room (OR) scheduling is one of the issues that has gained our attention because it is related to workflow efficiency and critical care of hospitals. Automatic scheduling and high predictive accuracy of surgical case duration have a critical role in improving OR utilization. To estimate surgical case duration, many hospitals rely on historic averages based on a specific surgeon or a specific procedure type obtained from electronic medical record (EMR) scheduling systems. However, the low predictive accuracy with EMR data leads to negative impacts on patients and hospitals, such as rescheduling of surgeries and cancellation. In this study, we aim to improve the prediction of surgical case duration with advanced machine learning (ML) algorithms. We obtained a large data set containing 170,748 surgical cases (from Jan 2017 to Dec 2019) from a hospital. The data covered a broad variety of details on patients, surgeries, specialties and surgical teams. In addition, a more recent data set with 8,672 cases (from Mar to Apr 2020) was available to be used for external evaluation. We computed historic averages from the EMR data for surgeon- or procedure-specific cases, and they were used as baseline models for comparison. Subsequently, we developed our models using linear regression, random forest and extreme gradient boosting (XGB) algorithms. All models were evaluated with R-square (R2), mean absolute error (MAE), and percentage overage (actual duration longer than prediction), underage (shorter than prediction) and within (within prediction). The XGB model was superior to the other models, achieving a higher R2 (85 %) and percentage within (48 %) as well as a lower MAE (30.2 min). The total prediction errors computed for all models showed that the XGB model had the lowest inaccurate percentage (23.7 %). Overall, this study applied ML techniques in the field of OR scheduling to reduce the medical and financial burden for healthcare management. The results revealed the importance of surgery and surgeon factors in surgical case duration prediction. This study also demonstrated the importance of performing an external evaluation to better validate the performance of ML models.

Download Full-text

Evaluation of Total Nitrogen in Water via Airborne Hyperspectral Data: Potential of Fractional Order Discretization Algorithm and Discrete Wavelet Transform Analysis

Remote Sensing ◽

10.3390/rs13224643 ◽

2021 ◽

Vol 13 (22) ◽

pp. 4643

Author(s):

Jinhua Liu ◽

Jianli Ding ◽

Xiangyu Ge ◽

Jingzhe Wang

Keyword(s):

Water Quality ◽

Wavelet Transform ◽

Discrete Wavelet Transform ◽

Fractional Order ◽

Total Nitrogen ◽

Hyperspectral Data ◽

Spectral Information ◽

Gradient Boosting ◽

Discrete Wavelet ◽

Extreme Gradient Boosting

Controlling and managing surface source pollution depends on the rapid monitoring of total nitrogen in water. However, the complex factors affecting water quality (plant shading and suspended matter in water) make direct estimation extremely challenging. Considering the spectral response mechanisms of emergent plants, we coupled discrete wavelet transform (DWT) and fractional order discretization (FOD) techniques with three machine learning models (random forest (RF), bagging algorithm (bagging), and eXtreme Gradient Boosting (XGBoost)) to mine this potential spectral information. A total of 567 models were developed, and airborne hyperspectral data processed with various DWT scales and FOD techniques were compared. The effective information in the hyperspectral reflectance data were better emphasized after DWT processing. After DWT processing the original spectrum (OR), its sensitivity to TN in water was maximally improved by 0.22, and the correlation between FOD and TN in water was optimally increased by 0.57. The transformed spectral information enhanced the TN model accuracy, especially for FOD after DWT. For RF, 82% of the model R2 values improved by 0.02~0.72 compared to the model using FOD spectra; 78.8% of the bagging values improved by 0.01~0.53 and 65.0% of the XGBoost values improved by 0.01~0.64. The XGBoost model with DWT coupled with grey relation analysis (GRA) yielded the best estimation accuracy, with the highest precision of R2 = 0.91 for L6. In conclusion, appropriately scaled DWT analysis can substantially improve the accuracy of extracting TN from UAV hyperspectral images. These outcomes may facilitate the further development of accurate water quality monitoring in sophisticated global waters from drone or satellite hyperspectral data.

Download Full-text

Dementia risks identified by vocal features via telephone conversations: A novel machine learning prediction model

PLoS ONE ◽

10.1371/journal.pone.0253988 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0253988

Author(s):

Akihiro Shimoda ◽

Yue Li ◽

Hana Hayashi ◽

Naoki Kondo

Keyword(s):

Machine Learning ◽

Prediction Model ◽

Predictive Performance ◽

Gradient Boosting ◽

Validation Data ◽

Audio File ◽

Extreme Gradient Boosting ◽

Audio Data ◽

Data Files ◽

Audio Files

Due to difficulty in early diagnosis of Alzheimer’s disease (AD) related to cost and differentiated capability, it is necessary to identify low-cost, accessible, and reliable tools for identifying AD risk in the preclinical stage. We hypothesized that cognitive ability, as expressed in the vocal features in daily conversation, is associated with AD progression. Thus, we have developed a novel machine learning prediction model to identify AD risk by using the rich voice data collected from daily conversations, and evaluated its predictive performance in comparison with a classification method based on the Japanese version of the Telephone Interview for Cognitive Status (TICS-J). We used 1,465 audio data files from 99 Healthy controls (HC) and 151 audio data files recorded from 24 AD patients derived from a dementia prevention program conducted by Hachioji City, Tokyo, between March and May 2020. After extracting vocal features from each audio file, we developed machine-learning models based on extreme gradient boosting (XGBoost), random forest (RF), and logistic regression (LR), using each audio file as one observation. We evaluated the predictive performance of the developed models by describing the receiver operating characteristic (ROC) curve, calculating the areas under the curve (AUCs), sensitivity, and specificity. Further, we conducted classifications by considering each participant as one observation, computing the average of their audio files’ predictive value, and making comparisons with the predictive performance of the TICS-J based questionnaire. Of 1,616 audio files in total, 1,308 (81.0%) were randomly allocated to the training data and 308 (19.1%) to the validation data. For audio file-based prediction, the AUCs for XGboost, RF, and LR were 0.863 (95% confidence interval [CI]: 0.794–0.931), 0.882 (95% CI: 0.840–0.924), and 0.893 (95%CI: 0.832–0.954), respectively. For participant-based prediction, the AUC for XGboost, RF, LR, and TICS-J were 1.000 (95%CI: 1.000–1.000), 1.000 (95%CI: 1.000–1.000), 0.972 (95%CI: 0.918–1.000) and 0.917 (95%CI: 0.918–1.000), respectively. There was difference in predictive accuracy of XGBoost and TICS-J with almost approached significance (p = 0.065). Our novel prediction model using the vocal features of daily conversations demonstrated the potential to be useful for the AD risk assessment.

Download Full-text