scholarly journals ShapeGTB: The role of local DNA shape in prioritization of functional variants in human promoters with machine learning

Author(s):  
Maja Malkowska ◽  
Julian Zubek ◽  
Dariusz Plewczynski ◽  
Lucjan S Wyrwicz

Motivation: The identification of functional sequence variations in regulatory DNA regions is one of the major challenges of modern genetics. Here, we report results of a combined multifactor analysis of properties characterizing functional sequence variants located in promoter regions of genes. Results: We demonstrate that GC-content of the local sequence fragments and local DNA shape features play significant role in prioritization of functional variants and outscore features related to histone modifications, transcription factors binding sites, or evolutionary conservation descriptors. Those observations allowed us to build specialized machine learning classifier identifying functional SNPs within promoter regions – ShapeGTB. We compared our method with more general tools predicting pathogenicity of all non-coding variants. ShapeGTB outperformed them by a wide margin (AUC ROC 0.97 vs. 0.57-0.59). On the external validation set based on ClinVar database it displayed only slightly worse performance (AUC ROC 0.92 vs. 0.74-0.81). Such results suggest unique characteristics of mutations located within promoter regions and are a promising signal for the development of more accurate variant prioritization tools in the future. Availability and implementation: The datasets and source code are publicly available at: https://github.com/zubekj/ShapeGTB.

2018 ◽  
Author(s):  
Maja Malkowska ◽  
Julian Zubek ◽  
Dariusz Plewczynski ◽  
Lucjan S Wyrwicz

Motivation: The identification of functional sequence variations in regulatory DNA regions is one of the major challenges of modern genetics. Here, we report results of a combined multifactor analysis of properties characterizing functional sequence variants located in promoter regions of genes. Results: We demonstrate that GC-content of the local sequence fragments and local DNA shape features play significant role in prioritization of functional variants and outscore features related to histone modifications, transcription factors binding sites, or evolutionary conservation descriptors. Those observations allowed us to build specialized machine learning classifier identifying functional SNPs within promoter regions – ShapeGTB. We compared our method with more general tools predicting pathogenicity of all non-coding variants. ShapeGTB outperformed them by a wide margin (AUC ROC 0.97 vs. 0.57-0.59). On the external validation set based on ClinVar database it displayed only slightly worse performance (AUC ROC 0.92 vs. 0.74-0.81). Such results suggest unique characteristics of mutations located within promoter regions and are a promising signal for the development of more accurate variant prioritization tools in the future. Availability and implementation: The datasets and source code are publicly available at: https://github.com/zubekj/ShapeGTB.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5742 ◽  
Author(s):  
Maja Malkowska ◽  
Julian Zubek ◽  
Dariusz Plewczynski ◽  
Lucjan S. Wyrwicz

MotivationThe identification of functional sequence variations in regulatory DNA regions is one of the major challenges of modern genetics. Here, we report results of a combined multifactor analysis of properties characterizing functional sequence variants located in promoter regions of genes.ResultsWe demonstrate that GC-content of the local sequence fragments and local DNA shape features play significant role in prioritization of functional variants and outscore features related to histone modifications, transcription factors binding sites, or evolutionary conservation descriptors. Those observations allowed us to build specialized machine learning classifier identifying functional single nucleotide polymorphisms within promoter regions—ShapeGTB. We compared our method with more general tools predicting pathogenicity of all non-coding variants. ShapeGTB outperformed them by a wide margin (average precision 0.93 vs. 0.47–0.55). On the external validation set based on ClinVar database it displayed worse performance but was still competitive with other methods (average precision 0.47 vs. 0.23–0.42). Such results suggest unique characteristics of mutations located within promoter regions and are a promising signal for the development of more accurate variant prioritization tools in the future.


Author(s):  
Tyler F. Rooks ◽  
Andrea S. Dargie ◽  
Valeta Carol Chancey

Abstract A shortcoming of using environmental sensors for the surveillance of potentially concussive events is substantial uncertainty regarding whether the event was caused by head acceleration (“head impacts”) or sensor motion (with no head acceleration). The goal of the present study is to develop a machine learning model to classify environmental sensor data obtained in the field and evaluate the performance of the model against the performance of the proprietary classification algorithm used by the environmental sensor. Data were collected from Soldiers attending sparring sessions conducted under a U.S. Army Combatives School course. Data from one sparring session were used to train a decision tree classification algorithm to identify good and bad signals. Data from the remaining sparring sessions were kept as an external validation set. The performance of the proprietary algorithm used by the sensor was also compared to the trained algorithm performance. The trained decision tree was able to correctly classify 95% of events for internal cross-validation and 88% of events for the external validation set. Comparatively, the proprietary algorithm was only able to correctly classify 61% of the events. In general, the trained algorithm was better able to predict when a signal was good or bad compared to the proprietary algorithm. The present study shows it is possible to train a decision tree algorithm using environmental sensor data collected in the field.


Cells ◽  
2019 ◽  
Vol 8 (10) ◽  
pp. 1286 ◽  
Author(s):  
Onat Kadioglu ◽  
Thomas Efferth

P-glycoprotein (P-gp) is an important determinant of multidrug resistance (MDR) because its overexpression is associated with increased efflux of various established chemotherapy drugs in many clinically resistant and refractory tumors. This leads to insufficient therapeutic targeting of tumor populations, representing a major drawback of cancer chemotherapy. Therefore, P-gp is a target for pharmacological inhibitors to overcome MDR. In the present study, we utilized machine learning strategies to establish a model for P-gp modulators to predict whether a given compound would behave as substrate or inhibitor of P-gp. Random forest feature selection algorithm-based leave-one-out random sampling was used. Testing the model with an external validation set revealed high performance scores. A P-gp modulator list of compounds from the ChEMBL database was used to test the performance, and predictions from both substrate and inhibitor classes were selected for the last step of validation with molecular docking. Predicted substrates revealed similar docking poses than that of doxorubicin, and predicted inhibitors revealed similar docking poses than that of the known P-gp inhibitor elacridar, implying the validity of the predictions. We conclude that the machine-learning approach introduced in this investigation may serve as a tool for the rapid detection of P-gp substrates and inhibitors in large chemical libraries.


Stroke ◽  
2021 ◽  
Vol 52 (Suppl_1) ◽  
Author(s):  
Lingling Ding ◽  
Zixiao Li ◽  
Yongjun Wang

Objective: We aimed to develop and validate a machine learning-based prediction model that could assess the risk of stroke-associated pneumonia (SAP) for individual patients with acute ischemic stroke (AIS). Methods: A machine-learning model incorporating A 2 DS 2 scores and clinical features (AN-ADCS 2 ) was developed to predict the risk of SAP in patients with AIS. Two independent datasets were used for model derivation and external validation. The area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were estimated. The further analysis evaluated thresholds from the training set that identified patients as low-risk, intermediate-risk and high-risk, and performance at these thresholds was compared in the external validation set. Results: The AN-ADCS 2 model achieved favorable performance with a high AUC of 0.892 (95% confidence interval [CI] 0.885-0.898) in the test set and similar performance in the external validation set (AUC 0.813 [95% CI 0.812-0.814]). The AN-ADCS 2 threshold identifying low-risk was 0.03, with a NPV of 97.6% (97.2-97.9%) and sensitivity of 93.5% (92.5-94.5%). The AN-ADCS 2 threshold identifying high-risk was 0.65, with a PPV of 94.7% (93.9-95.6%) and specificity of 99.5% (99.5-99.6%). The AN-ADCS 2 model performed better than the A 2 DS 2 score (AUC 0.739, 95%CI [0.720-0.754]). Having a high risk of SAP classified by the AN-ADCS 2 was associated with unfavorable outcomes of mortality and in-hospital stroke recurrence. Conclusions: Using machine learning, the AN-ADCS 2 model provides an individualized risk prediction of SAP, which can be used as an indicator of clinical prognosis for patients with AIS.


2021 ◽  
Vol 9 ◽  
Author(s):  
Yang Wu ◽  
Haofei Hu ◽  
Jinlin Cai ◽  
Runtian Chen ◽  
Xin Zuo ◽  
...  

Purpose: We aimed to establish and validate a risk assessment system that combines demographic and clinical variables to predict the 3-year risk of incident diabetes in Chinese adults.Methods: A 3-year cohort study was performed on 15,928 Chinese adults without diabetes at baseline. All participants were randomly divided into a training set (n = 7,940) and a validation set (n = 7,988). XGBoost method is an effective machine learning technique used to select the most important variables from candidate variables. And we further established a stepwise model based on the predictors chosen by the XGBoost model. The area under the receiver operating characteristic curve (AUC), decision curve and calibration analysis were used to assess discrimination, clinical use and calibration of the model, respectively. The external validation was performed on a cohort of 11,113 Japanese participants.Result: In the training and validation sets, 148 and 145 incident diabetes cases occurred. XGBoost methods selected the 10 most important variables from 15 candidate variables. Fasting plasma glucose (FPG), body mass index (BMI) and age were the top 3 important variables. And we further established a stepwise model and a prediction nomogram. The AUCs of the stepwise model were 0.933 and 0.910 in the training and validation sets, respectively. The Hosmer-Lemeshow test showed a perfect fit between the predicted diabetes risk and the observed diabetes risk (p = 0.068 for the training set, p = 0.165 for the validation set). Decision curve analysis presented the clinical use of the stepwise model and there was a wide range of alternative threshold probability spectrum. And there were almost no the interactions between these predictors (most P-values for interaction >0.05). Furthermore, the AUC for the external validation set was 0.830, and the Hosmer-Lemeshow test for the external validation set showed no statistically significant difference between the predicted diabetes risk and observed diabetes risk (P = 0.824).Conclusion: We established and validated a risk assessment system for characterizing the 3-year risk of incident diabetes.


2020 ◽  
Author(s):  
Yongyue Wei ◽  
Jieyu He ◽  
Jiao Chen ◽  
Ying Zhu ◽  
Jiajin Chen ◽  
...  

Abstract Background Novel coronavirus disease (COVID-19) is an emerging, rapidly evolving situation. At present, the prognosis of severe and critically ill patients has become an important focus of attention. We strived to develop a prognostic prediction model for severe and critically ill COVID-19 patients.MethodsTo assess the factors associated with the prognosis of those patients, we retrospectively investigated the clinical, laboratory characteristics of confirmed 112 cases of COVID-19 admitted between 21 January to 6 March 2020 from Huangshi Central Hospital, Huangshi Hospital of Traditional Chinese Medicine, and Daye People’s Hospital. We applied machine learning method (survival random forest) to select predictors for 28-day survival and taken into account the dynamic trajectory of laboratory indicators. Results Fifteen candidate prognostic features, including 11 baseline measures (including platelet count (PLT), urea, creatine kinase (CK), fibrinogen, creatine kinase isoenzyme activity, aspartate aminotransferase (AST), activation of partial thromboplastin time (APTT), albumin, standard deviation of erythrocyte distribution width (RBC-SD), neutrophils (%) and red blood cell count (RBC)) and 4 trajectory clusters (changes during hospitalization in the white blood cell (WBC), PLT large cell ratio (P-LCR), PLT distribution width (PDW) and AST), combined with covariates achieved 100% (95%CI: 99%-100%) AUC and reached 87% (95%CI: 84%-91%) AUC in an external validation set. Conclusions Taking advantage of random forest technique and laboratory dynamic measures, we developed a forest model to predict survival outcome of COVID-19 patients, which achieved 87% AUC in the external validation set. Our online tool will help to facilitate the early recognition of patients with high risk.


JAMIA Open ◽  
2021 ◽  
Vol 4 (3) ◽  
Author(s):  
Omolola I Ogunyemi ◽  
Meghal Gandhi ◽  
Martin Lee ◽  
Senait Teklehaimanot ◽  
Lauren Patty Daskivich ◽  
...  

Abstract Objective Clinical guidelines recommend annual eye examinations to detect diabetic retinopathy (DR) in patients with diabetes. However, timely DR detection remains a problem in medically underserved and under-resourced settings in the United States. Machine learning that identifies patients with latent/undiagnosed DR could help to address this problem. Materials and Methods Using electronic health record data from 40 631 unique diabetic patients seen at Los Angeles County Department of Health Services healthcare facilities between January 1, 2015 and December 31, 2017, we compared ten machine learning environments, including five classifier models, for assessing the presence or absence of DR. We also used data from a distinct set of 9300 diabetic patients seen between January 1, 2018 and December 31, 2018 as an external validation set. Results Following feature subset selection, the classifier with the best AUC on the external validation set was a deep neural network using majority class undersampling, with an AUC of 0.8, the sensitivity of 72.17%, and specificity of 74.2%. Discussion A deep neural network produced the best AUCs and sensitivity results on the test set and external validation set. Models are intended to be used to screen guideline noncompliant diabetic patients in an urban safety-net setting. Conclusion Machine learning on diabetic patients’ routinely collected clinical data could help clinicians in safety-net settings to identify and target unscreened diabetic patients who potentially have undiagnosed DR.


2021 ◽  
Vol 14 (3) ◽  
pp. 1-21
Author(s):  
Roy Abitbol ◽  
Ilan Shimshoni ◽  
Jonathan Ben-Dov

The task of assembling fragments in a puzzle-like manner into a composite picture plays a significant role in the field of archaeology as it supports researchers in their attempt to reconstruct historic artifacts. In this article, we propose a method for matching and assembling pairs of ancient papyrus fragments containing mostly unknown scriptures. Papyrus paper is manufactured from papyrus plants and therefore portrays typical thread patterns resulting from the plant’s stems. The proposed algorithm is founded on the hypothesis that these thread patterns contain unique local attributes such that nearby fragments show similar patterns reflecting the continuations of the threads. We posit that these patterns can be exploited using image processing and machine learning techniques to identify matching fragments. The algorithm and system which we present support the quick and automated classification of matching pairs of papyrus fragments as well as the geometric alignment of the pairs against each other. The algorithm consists of a series of steps and is based on deep-learning and machine learning methods. The first step is to deconstruct the problem of matching fragments into a smaller problem of finding thread continuation matches in local edge areas (squares) between pairs of fragments. This phase is solved using a convolutional neural network ingesting raw images of the edge areas and producing local matching scores. The result of this stage yields very high recall but low precision. Thus, we utilize these scores in order to conclude about the matching of entire fragments pairs by establishing an elaborate voting mechanism. We enhance this voting with geometric alignment techniques from which we extract additional spatial information. Eventually, we feed all the data collected from these steps into a Random Forest classifier in order to produce a higher order classifier capable of predicting whether a pair of fragments is a match. Our algorithm was trained on a batch of fragments which was excavated from the Dead Sea caves and is dated circa the 1st century BCE. The algorithm shows excellent results on a validation set which is of a similar origin and conditions. We then tried to run the algorithm against a real-life set of fragments for which we have no prior knowledge or labeling of matches. This test batch is considered extremely challenging due to its poor condition and the small size of its fragments. Evidently, numerous researchers have tried seeking matches within this batch with very little success. Our algorithm performance on this batch was sub-optimal, returning a relatively large ratio of false positives. However, the algorithm was quite useful by eliminating 98% of the possible matches thus reducing the amount of work needed for manual inspection. Indeed, experts that reviewed the results have identified some positive matches as potentially true and referred them for further investigation.


Sign in / Sign up

Export Citation Format

Share Document