scholarly journals Nonparametric Subgroup Identification by PRIM and CART: A Simulation and Application Study

2017 ◽  
Vol 2017 ◽  
pp. 1-17 ◽  
Author(s):  
Armin Ott ◽  
Alexander Hapfelmeier

Two nonparametric methods for the identification of subgroups with outstanding outcome values are described and compared to each other in a simulation study and an application to clinical data. The Patient Rule Induction Method (PRIM) searches for box-shaped areas in the given data which exceed a minimal size and average outcome. This is achieved via a combination of iterative peeling and pasting steps, where small fractions of the data are removed or added to the current box. As an alternative, Classification and Regression Trees (CART) prediction models perform sequential binary splits of the data to produce subsets which can be interpreted as subgroups of heterogeneous outcome. PRIM and CART were compared in a simulation study to investigate their strengths and weaknesses under various data settings, taking different performance measures into account. PRIM was shown to be superior in rather complex settings such as those with few observations, a smaller signal-to-noise ratio, and more than one subgroup. CART showed the best performance in simpler situations. A practical application of the two methods was illustrated using a clinical data set. For this application, both methods produced similar results but the higher amount of user involvement of PRIM became apparent. PRIM can be flexibly tuned by the user, whereas CART, although simpler to implement, is rather static.

2019 ◽  
Vol 21 (9) ◽  
pp. 662-669 ◽  
Author(s):  
Junnan Zhao ◽  
Lu Zhu ◽  
Weineng Zhou ◽  
Lingfeng Yin ◽  
Yuchen Wang ◽  
...  

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Philipp Rentzsch ◽  
Max Schubach ◽  
Jay Shendure ◽  
Martin Kircher

Abstract Background Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. Methods It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. Results We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. Conclusions While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.


2015 ◽  
Vol 17 (5) ◽  
pp. 719-732
Author(s):  
Dulakshi Santhusitha Kumari Karunasingha ◽  
Shie-Yui Liong

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.


2015 ◽  
Vol 26 (6) ◽  
pp. 2586-2602 ◽  
Author(s):  
Irantzu Barrio ◽  
Inmaculada Arostegui ◽  
María-Xosé Rodríguez-Álvarez ◽  
José-María Quintana

When developing prediction models for application in clinical practice, health practitioners usually categorise clinical variables that are continuous in nature. Although categorisation is not regarded as advisable from a statistical point of view, due to loss of information and power, it is a common practice in medical research. Consequently, providing researchers with a useful and valid categorisation method could be a relevant issue when developing prediction models. Without recommending categorisation of continuous predictors, our aim is to propose a valid way to do it whenever it is considered necessary by clinical researchers. This paper focuses on categorising a continuous predictor within a logistic regression model, in such a way that the best discriminative ability is obtained in terms of the highest area under the receiver operating characteristic curve (AUC). The proposed methodology is validated when the optimal cut points’ location is known in theory or in practice. In addition, the proposed method is applied to a real data-set of patients with an exacerbation of chronic obstructive pulmonary disease, in the context of the IRYSS-COPD study where a clinical prediction rule for severe evolution was being developed. The clinical variable PCO2 was categorised in a univariable and a multivariable setting.


Author(s):  
Guizhou Hu ◽  
Martin M. Root

Background No methodology is currently available to allow the combining of individual risk factor information derived from different longitudinal studies for a chronic disease in a multivariate fashion. This paper introduces such a methodology, named Synthesis Analysis, which is essentially a multivariate meta-analytic technique. Design The construction and validation of statistical models using available data sets. Methods and results Two analyses are presented. (1) With the same data, Synthesis Analysis produced a similar prediction model to the conventional regression approach when using the same risk variables. Synthesis Analysis produced better prediction models when additional risk variables were added. (2) A four-variable empirical logistic model for death from coronary heart disease was developed with data from the Framingham Heart Study. A synthesized prediction model with five new variables added to this empirical model was developed using Synthesis Analysis and literature information. This model was then compared with the four-variable empirical model using the first National Health and Nutrition Examination Survey (NHANES I) Epidemiologic Follow-up Study data set. The synthesized model had significantly improved predictive power ( x2 = 43.8, P < 0.00001). Conclusions Synthesis Analysis provides a new means of developing complex disease predictive models from the medical literature.


2018 ◽  
Vol 615 ◽  
pp. A145 ◽  
Author(s):  
M. Mol Lous ◽  
E. Weenk ◽  
M. A. Kenworthy ◽  
K. Zwintz ◽  
R. Kuschnig

Context. Transiting exoplanets provide an opportunity for the characterization of their atmospheres, and finding the brightest star in the sky with a transiting planet enables high signal-to-noise ratio observations. The Kepler satellite has detected over 365 multiple transiting exoplanet systems, a large fraction of which have nearly coplanar orbits. If one planet is seen to transit the star, then it is likely that other planets in the system will transit the star too. The bright (V = 3.86) star β Pictoris is a nearby young star with a debris disk and gas giant exoplanet, β Pictoris b, in a multi-decade orbit around it. Both the planet’s orbit and disk are almost edge-on to our line of sight. Aims. We carry out a search for any transiting planets in the β Pictoris system with orbits of less than 30 days that are coplanar with the planet β Pictoris b. Methods. We search for a planetary transit using data from the BRITE-Constellation nanosatellite BRITE-Heweliusz, analyzing the photometry using the Box-Fitting Least Squares Algorithm (BLS). The sensitivity of the method is verified by injection of artificial planetary transit signals using the Bad-Ass Transit Model cAlculatioN (BATMAN) code. Results. No planet was found in the BRITE-Constellation data set. We rule out planets larger than 0.6 RJ for periods of less than 5 days, larger than 0.75 RJ for periods of less than 10 days, and larger than 1.05 RJ for periods of less than 20 days.


PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e10681
Author(s):  
Jake Dickinson ◽  
Marcel de Matas ◽  
Paul A. Dickinson ◽  
Hitesh B. Mistry

Purpose To assess whether a model-based analysis increased statistical power over an analysis of final day volumes and provide insights into more efficient patient derived xenograft (PDX) study designs. Methods Tumour xenograft time-series data was extracted from a public PDX drug treatment database. For all 2-arm studies the percent tumour growth inhibition (TGI) at day 14, 21 and 28 was calculated. Treatment effect was analysed using an un-paired, two-tailed t-test (empirical) and a model-based analysis, likelihood ratio-test (LRT). In addition, a simulation study was performed to assess the difference in power between the two data-analysis approaches for PDX or standard cell-line derived xenografts (CDX). Results The model-based analysis had greater statistical power than the empirical approach within the PDX data-set. The model-based approach was able to detect TGI values as low as 25% whereas the empirical approach required at least 50% TGI. The simulation study confirmed the findings and highlighted that CDX studies require fewer animals than PDX studies which show the equivalent level of TGI. Conclusions The study conducted adds to the growing literature which has shown that a model-based analysis of xenograft data improves statistical power over the common empirical approach. The analysis conducted showed that a model-based approach, based on the first mathematical model of tumour growth, was able to detect smaller size of effect compared to the empirical approach which is common of such studies. A model-based analysis should allow studies to reduce animal use and experiment length providing effective insights into compound anti-tumour activity.


Sign in / Sign up

Export Citation Format

Share Document