Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework

Briefings in Bioinformatics ◽

10.1093/bib/bbaa275 ◽

2020 ◽

Cited By ~ 1

Author(s):

Leyi Wei ◽

Wenjia He ◽

Adeel Malik ◽

Ran Su ◽

Lizhen Cui ◽

...

Keyword(s):

Dna Replication ◽

Prediction Models ◽

Homo Sapiens ◽

Computational Prediction ◽

Specific Model ◽

Gradient Boosting ◽

Replication Process ◽

Extreme Gradient Boosting ◽

Feature Encoding ◽

Process Detection

Abstract Origins of replication sites (ORIs), which refers to the initiative locations of genomic DNA replication, play essential roles in DNA replication process. Detection of ORIs’ distribution in genome scale is one of key steps to in-depth understanding their regulation mechanisms. In this study, we presented a novel machine learning-based approach called Stack-ORI encompassing 10 cell-specific prediction models for identifying ORIs from four different eukaryotic species (Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana). For each cell-specific model, we employed 12 feature encoding schemes that cover nucleic acid composition, position-specific and physicochemical properties information. The optimal feature set was identified from each encoding individually and developed their respective baseline models using the eXtreme Gradient Boosting (XGBoost) classifier. Subsequently, the predicted scores of 12 baseline models are integrated as a novel feature vector to train XGBoost and develop the final model. Extensive experimental results show that Stack-ORI achieves significantly better performance as compared with their baseline models on both training and independent datasets. Interestingly, Stack-ORI consistently outperforms existing predictor in all cell-specific models, not only on training but also on independent test. Moreover, our novel approach provides necessary interpretations that help understanding model success by leveraging the powerful SHapley Additive exPlanation algorithm, thus underlining the most important feature encoding schemes significant for predicting cell-specific ORIs.

Download Full-text

Evaluating Modeling and Validation Strategies for Tooth Loss

Journal of Dental Research ◽

10.1177/0022034519864889 ◽

2019 ◽

Vol 98 (10) ◽

pp. 1088-1095 ◽

Cited By ~ 2

Author(s):

J. Krois ◽

C. Graetz ◽

B. Holtfreter ◽

P. Brinkmann ◽

T. Kocher ◽

...

Keyword(s):

Tooth Loss ◽

Predictive Power ◽

Prediction Models ◽

Recursive Partitioning ◽

External Validation ◽

Model Development ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

Complex Models ◽

Development And Validation

Prediction models learn patterns from available data (training) and are then validated on new data (testing). Prediction modeling is increasingly common in dental research. We aimed to evaluate how different model development and validation steps affect the predictive performance of tooth loss prediction models of patients with periodontitis. Two independent cohorts (627 patients, 11,651 teeth) were followed over a mean ± SD 18.2 ± 5.6 y (Kiel cohort) and 6.6 ± 2.9 y (Greifswald cohort). Tooth loss and 10 patient- and tooth-level predictors were recorded. The impact of different model development and validation steps was evaluated: 1) model complexity (logistic regression, recursive partitioning, random forest, extreme gradient boosting), 2) sample size (full data set or 10%, 25%, or 75% of cases dropped at random), 3) prediction periods (maximum 10, 15, or 20 y or uncensored), and 4) validation schemes (internal or external by centers/time). Tooth loss was generally a rare event (880 teeth were lost). All models showed limited sensitivity but high specificity. Patients’ age and tooth loss at baseline as well as probing pocket depths showed high variable importance. More complex models (random forest, extreme gradient boosting) had no consistent advantages over simpler ones (logistic regression, recursive partitioning). Internal validation (in sample) overestimated the predictive power (area under the curve up to 0.90), while external validation (out of sample) found lower areas under the curve (range 0.62 to 0.82). Reducing the sample size decreased the predictive power, particularly for more complex models. Censoring the prediction period had only limited impact. When the model was trained in one period and tested in another, model outcomes were similar to the base case, indicating temporal validation as a valid option. No model showed higher accuracy than the no-information rate. In conclusion, none of the developed models would be useful in a clinical setting, despite high accuracy. During modeling, rigorous development and external validation should be applied and reported accordingly.

Download Full-text

Using eXtreme Gradient BOOSTing to Predict Changes in Tropical Cyclone Intensity over the Western North Pacific

Atmosphere ◽

10.3390/atmos10060341 ◽

2019 ◽

Vol 10 (6) ◽

pp. 341 ◽

Cited By ~ 4

Author(s):

Qingwen Jin ◽

Xiangtao Fan ◽

Jian Liu ◽

Zhuxin Xue ◽

Hongdeng Jian

Keyword(s):

North Pacific ◽

Western North Pacific ◽

Prediction Models ◽

Weather Prediction ◽

Back Propagation ◽

Absolute Error ◽

Gradient Boosting ◽

Lead Times ◽

Intensity Prediction ◽

Extreme Gradient Boosting

Coastal cities in China are frequently hit by tropical cyclones (TCs), which result in tremendous loss of life and property. Even though the capability of numerical weather prediction models to forecast and track TCs has considerably improved in recent years, forecasting the intensity of a TC is still very difficult; thus, it is necessary to improve the accuracy of TC intensity prediction. To this end, we established a series of predictors using the Best Track TC dataset to predict the intensity of TCs in the Western North Pacific with an eXtreme Gradient BOOSTing (XGBOOST) model. The climatology and persistence factors, environmental factors, brainstorm features, intensity categories, and TC months are considered inputs for the models while the output is the TC intensity. The performance of the XGBOOST model was tested for very strong TCs such as Hato (2017), Rammasum (2014), Mujiage (2015), and Hagupit (2014). The results obtained show that the combination of inputs chosen were the optimal predictors for TC intensification with lead times of 6, 12, 18, and 24 h. Furthermore, the mean absolute error (MAE) of the XGBOOST model was much smaller than the MAEs of a back propagation neural network (BPNN) used to predict TC intensity. The MAEs of the forecasts with 6, 12, 18, and 24 h lead times for the test samples used were 1.61, 2.44, 3.10, and 3.70 m/s, respectively, for the XGBOOST model. The results indicate that the XGBOOST model developed in this study can be used to improve TC intensity forecast accuracy and can be considered a better alternative to conventional operational forecast models for TC intensity prediction.

Download Full-text

Clinical and Laboratory Predictors of In-hospital Mortality in Patients With Coronavirus Disease-2019: A Cohort Study in Wuhan, China

Clinical Infectious Diseases ◽

10.1093/cid/ciaa538 ◽

2020 ◽

Vol 71 (16) ◽

pp. 2079-2088 ◽

Cited By ~ 52

Author(s):

Kun Wang ◽

Peiyuan Zuo ◽

Yuwei Liu ◽

Meng Zhang ◽

Xiaofang Zhao ◽

...

Keyword(s):

Hospital Mortality ◽

Prediction Models ◽

Area Under The Curve ◽

Mortality Prediction ◽

Gradient Boosting ◽

Laboratory Model ◽

Training Cohort ◽

Clinical Model ◽

Extreme Gradient Boosting ◽

Mortality Prediction Models

Abstract Background This study aimed to develop mortality-prediction models for patients with coronavirus disease-2019 (COVID-19). Methods The training cohort included consecutive COVID-19 patients at the First People’s Hospital of Jiangxia District in Wuhan, China, from 7 January 2020 to 11 February 2020. We selected baseline data through the stepwise Akaike information criterion and ensemble XGBoost (extreme gradient boosting) model to build mortality-prediction models. We then validated these models by randomly collected COVID-19 patients in Union Hospital, Wuhan, from 1 January 2020 to 20 February 2020. Results A total of 296 COVID-19 patients were enrolled in the training cohort; 19 died during hospitalization and 277 discharged from the hospital. The clinical model developed using age, history of hypertension, and coronary heart disease showed area under the curve (AUC), 0.88 (95% confidence interval [CI], .80–.95); threshold, −2.6551; sensitivity, 92.31%; specificity, 77.44%; and negative predictive value (NPV), 99.34%. The laboratory model developed using age, high-sensitivity C-reactive protein, peripheral capillary oxygen saturation, neutrophil and lymphocyte count, d-dimer, aspartate aminotransferase, and glomerular filtration rate had a significantly stronger discriminatory power than the clinical model (P = .0157), with AUC, 0.98 (95% CI, .92–.99); threshold, −2.998; sensitivity, 100.00%; specificity, 92.82%; and NPV, 100.00%. In the subsequent validation cohort (N = 44), the AUC (95% CI) was 0.83 (.68–.93) and 0.88 (.75–.96) for the clinical model and laboratory model, respectively. Conclusions We developed 2 predictive models for the in-hospital mortality of patients with COVID-19 in Wuhan that were validated in patients from another center.

Download Full-text

Property Rental Price Prediction Using the Extreme Gradient Boosting Algorithm

IJIIS: International Journal of Informatics and Information Systems ◽

10.47738/ijiis.v3i2.65 ◽

2020 ◽

Vol 3 (2) ◽

pp. 54-59

Author(s):

Marco Febriadi Kokasih ◽

Adi Suryaputra Paramita

Keyword(s):

Prediction Model ◽

Prediction Models ◽

Gradient Boosting ◽

Fair Price ◽

Price Prediction ◽

Online Marketplace ◽

Extreme Gradient Boosting ◽

Property Owners ◽

Boosting Algorithm

Online marketplace in the field of property renting like Airbnb is growing. Many property owners have begun renting out their properties to fulfil this demand. Determining a fair price for both property owners and tourists is a challenge. Therefore, this study aims to create a software that can create a prediction model for property rent price. Variable that will be used for this study is listing feature, neighbourhood, review, date and host information. Prediction model is created based on the dataset given by the user and processed with Extreme Gradient Boosting algorithm which then will be stored in the system. The result of this study is expected to create prediction models for property rent price for property owners and tourists consideration when considering to rent a property. In conclusion, Extreme Gradient Boosting algorithm is able to create property rental price prediction with the average of RMSE of 10.86 or 13.30%.

Download Full-text

Machine Learning-Based Three-Month Outcome Prediction in Acute Ischemic Stroke: A Single Cerebrovascular-Specialty Hospital Study in South Korea

Diagnostics ◽

10.3390/diagnostics11101909 ◽

2021 ◽

Vol 11 (10) ◽

pp. 1909

Author(s):

Dougho Park ◽

Eunhwan Jeong ◽

Haejong Kim ◽

Hae Wook Pyun ◽

Haemin Kim ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Ischemic Stroke ◽

Acute Ischemic Stroke ◽

Functional Outcome ◽

Outcome Prediction ◽

Prediction Models ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting

Background: Functional outcomes after acute ischemic stroke are of great concern to patients and their families, as well as physicians and surgeons who make the clinical decisions. We developed machine learning (ML)-based functional outcome prediction models in acute ischemic stroke. Methods: This retrospective study used a prospective cohort database. A total of 1066 patients with acute ischemic stroke between January 2019 and March 2021 were included. Variables such as demographic factors, stroke-related factors, laboratory findings, and comorbidities were utilized at the time of admission. Five ML algorithms were applied to predict a favorable functional outcome (modified Rankin Scale 0 or 1) at 3 months after stroke onset. Results: Regularized logistic regression showed the best performance with an area under the receiver operating characteristic curve (AUC) of 0.86. Support vector machines represented the second-highest AUC of 0.85 with the highest F1-score of 0.86, and finally, all ML models applied achieved an AUC > 0.8. The National Institute of Health Stroke Scale at admission and age were consistently the top two important variables for generalized logistic regression, random forest, and extreme gradient boosting models. Conclusions: ML-based functional outcome prediction models for acute ischemic stroke were validated and proven to be readily applicable and useful.

Download Full-text

An explainable XGBoost–based approach towards assessing the risk of cardiovascular disease in patients with Type 2 Diabetes Mellitus

10.36227/techrxiv.12942299.v1 ◽

2020 ◽

Author(s):

Maria Athanasiou ◽

Konstantina Sfrintzeri ◽

Konstantia Zarkogianni ◽

Anastasia Thanopoulou ◽

Konstantina S. Nikita

Keyword(s):

Diabetes Mellitus ◽

Cardiovascular Disease ◽

Risk Prediction ◽

Prediction Models ◽

Therapeutic Interventions ◽

Gradient Boosting ◽

Cvd Risk ◽

Medical Visits ◽

Extreme Gradient Boosting

<div> <div> <div> <p>Cardiovascular Disease (CVD) is an important cause of disability and death among individuals with Diabetes Mellitus (DM). International clinical guidelines for the management of Type 2 DM (T2DM) are founded on primary and secondary prevention and favor the evaluation of CVD related risk factors towards appropriate treatment initiation. CVD risk prediction models can provide valuable tools for optimizing the frequency of medical visits and performing timely preventive and therapeutic interventions against CVD events. The integration of explainability modalities in these models can enhance human understanding on the reasoning process, maximize transparency and embellish trust towards the models’ adoption in clinical practice. The aim of the present study is to develop and evaluate an explainable personalized risk prediction model for the fatal or non-fatal CVD incidence in T2DM individuals. An explainable approach based on the eXtreme Gradient Boosting (XGBoost) and the Tree SHAP (SHapley Additive exPlanations) method is deployed for the calculation of the 5-year CVD risk and the generation of individual explanations on the model’s decisions. Data from the 5- year follow up of 560 patients with T2DM are used for development and evaluation purposes. The obtained results (AUC=71.13%) indicate the potential of the proposed approach to handle the unbalanced nature of the used dataset, while providing clinically meaningful insights about the ensemble model’s decision process. </p> </div> </div> </div>

Download Full-text

Prediction of Masked Hypertension and Masked Uncontrolled Hypertension Using Machine Learning

Frontiers in Cardiovascular Medicine ◽

10.3389/fcvm.2021.778306 ◽

2021 ◽

Vol 8 ◽

Author(s):

Ming-Hui Hung ◽

Ling-Chieh Shih ◽

Yu-Ching Wang ◽

Hsin-Bang Leu ◽

Po-Hsun Huang ◽

...

Keyword(s):

Machine Learning ◽

Clinical Characteristics ◽

Prediction Models ◽

External Validation ◽

Uncontrolled Hypertension ◽

Gradient Boosting ◽

Masked Hypertension ◽

Internal Validation ◽

Hypertensive Patients ◽

Extreme Gradient Boosting

Objective: This study aimed to develop machine learning-based prediction models to predict masked hypertension and masked uncontrolled hypertension using the clinical characteristics of patients at a single outpatient visit.Methods: Data were derived from two cohorts in Taiwan. The first cohort included 970 hypertensive patients recruited from six medical centers between 2004 and 2005, which were split into a training set (n = 679), a validation set (n = 146), and a test set (n = 145) for model development and internal validation. The second cohort included 416 hypertensive patients recruited from a single medical center between 2012 and 2020, which was used for external validation. We used 33 clinical characteristics as candidate variables to develop models based on logistic regression (LR), random forest (RF), eXtreme Gradient Boosting (XGboost), and artificial neural network (ANN).Results: The four models featured high sensitivity and high negative predictive value (NPV) in internal validation (sensitivity = 0.914–1.000; NPV = 0.853–1.000) and external validation (sensitivity = 0.950–1.000; NPV = 0.875–1.000). The RF, XGboost, and ANN models showed much higher area under the receiver operating characteristic curve (AUC) (0.799–0.851 in internal validation, 0.672–0.837 in external validation) than the LR model. Among the models, the RF model, composed of 6 predictor variables, had the best overall performance in both internal and external validation (AUC = 0.851 and 0.837; sensitivity = 1.000 and 1.000; specificity = 0.609 and 0.580; NPV = 1.000 and 1.000; accuracy = 0.766 and 0.721, respectively).Conclusion: An effective machine learning-based predictive model that requires data from a single clinic visit may help to identify masked hypertension and masked uncontrolled hypertension.

Download Full-text

Predicting Hard Rock Pillar Stability Using GBDT, XGBoost, and LightGBM Algorithms

Mathematics ◽

10.3390/math8050765 ◽

2020 ◽

Vol 8 (5) ◽

pp. 765 ◽

Cited By ~ 6

Author(s):

Weizhang Liang ◽

Suizhi Luo ◽

Guoyan Zhao ◽

Hao Wu

Keyword(s):

Large Scale ◽

Prediction Models ◽

Hard Rock ◽

Gradient Boosting ◽

Pillar Stability ◽

Rock Pillar ◽

Light Gradient ◽

Gradient Boosting Machine ◽

Extreme Gradient Boosting ◽

Hard Rock Mines

Predicting pillar stability is a vital task in hard rock mines as pillar instability can cause large-scale collapse hazards. However, it is challenging because the pillar stability is affected by many factors. With the accumulation of pillar stability cases, machine learning (ML) has shown great potential to predict pillar stability. This study aims to predict hard rock pillar stability using gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM) algorithms. First, 236 cases with five indicators were collected from seven hard rock mines. Afterwards, the hyperparameters of each model were tuned using a five-fold cross validation (CV) approach. Based on the optimal hyperparameters configuration, prediction models were constructed using training set (70% of the data). Finally, the test set (30% of the data) was adopted to evaluate the performance of each model. The precision, recall, and F1 indexes were utilized to analyze prediction results of each level, and the accuracy and their macro average values were used to assess the overall prediction performance. Based on the sensitivity analysis of indicators, the relative importance of each indicator was obtained. In addition, the safety factor approach and other ML algorithms were adopted as comparisons. The results showed that GBDT, XGBoost, and LightGBM algorithms achieved a better comprehensive performance, and their prediction accuracies were 0.8310, 0.8310, and 0.8169, respectively. The average pillar stress and ratio of pillar width to pillar height had the most important influences on prediction results. The proposed methodology can provide a reliable reference for pillar design and stability risk management.

Download Full-text

sefOri: selecting the best-engineered sequence features to predict DNA replication origins

Bioinformatics ◽

10.1093/bioinformatics/btz506 ◽

2019 ◽

Vol 36 (1) ◽

pp. 49-55 ◽

Cited By ~ 4

Author(s):

Chenwei Lou ◽

Jian Zhao ◽

Ruoyao Shi ◽

Qian Wang ◽

Wenyang Zhou ◽

...

Keyword(s):

Dna Replication ◽

Prediction Models ◽

Selection Procedure ◽

Classification Model ◽

Supplementary Information ◽

Replication Origins ◽

Replication Process ◽

Yeast Saccharomyces Cerevisiae ◽

Dna Replication Origin ◽

Dna Replication Origins

AbstractMotivationCell divisions start from replicating the double-stranded DNA, and the DNA replication process needs to be precisely regulated both spatially and temporally. The DNA is replicated starting from the DNA replication origins. A few successful prediction models were generated based on the assumption that the DNA replication origin regions have sequence level features like physicochemical properties significantly different from the other DNA regions.ResultsThis study proposed a feature selection procedure to further refine the classification model of the DNA replication origins. The experimental data demonstrated that as large as 26% improvement in the prediction accuracy may be achieved on the yeast Saccharomyces cerevisiae. Moreover, the prediction accuracies of the DNA replication origins were improved for all the four yeast genomes investigated in this study.Availability and implementationThe software sefOri version 1.0 was available at http://www.healthinformaticslab.org/supp/resources.php. An online server was also provided for the convenience of the users, and its web link may be found in the above-mentioned web page.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Sovereign Debt and Currency Crises Prediction Models Using Machine Learning Techniques

Symmetry ◽

10.3390/sym13040652 ◽

2021 ◽

Vol 13 (4) ◽

pp. 652

Author(s):

David Alaminos ◽

José Ignacio Peláez ◽

M. Belén Salas ◽

Manuel A. Fernández-Gámez

Keyword(s):

Deep Learning ◽

Decision Trees ◽

Financial Crises ◽

Sovereign Debt ◽

Currency Crises ◽

Prediction Models ◽

Debt Crisis ◽

Gradient Boosting ◽

Computational Techniques ◽

Extreme Gradient Boosting

Sovereign debt and currencies play an increasingly influential role in the development of any country, given the need to obtain financing and establish international relations. A recurring theme in the literature on financial crises has been the prediction of sovereign debt and currency crises due to their extreme importance in international economic activity. Nevertheless, the limitations of the existing models are related to accuracy and the literature calls for more investigation on the subject and lacks geographic diversity in the samples used. This article presents new models for the prediction of sovereign debt and currency crises, using various computational techniques, which increase their precision. Also, these models present experiences with a wide global sample of the main geographical world zones, such as Africa and the Middle East, Latin America, Asia, Europe, and globally. Our models demonstrate the superiority of computational techniques concerning statistics in terms of the level of precision, which are the best methods for the sovereign debt crisis: fuzzy decision trees, AdaBoost, extreme gradient boosting, and deep learning neural decision trees, and for forecasting the currency crisis: deep learning neural decision trees, extreme gradient boosting, random forests, and deep belief network. Our research has a large and potentially significant impact on the macroeconomic policy adequacy of the countries against the risks arising from financial crises and provides instruments that make it possible to improve the balance in the finance of the countries.

Download Full-text