Combining clinical and molecular data in regression prediction models: insights from a simulation study

Riccardo De Bin; Anne-Laure Boulesteix; Axel Benner; Natalia Becker; Willi Sauerbrei

doi:10.1093/bib/bbz136

Combining clinical and molecular data in regression prediction models: insights from a simulation study

Briefings in Bioinformatics ◽

10.1093/bib/bbz136 ◽

2019 ◽

Vol 21 (6) ◽

pp. 1904-1919 ◽

Cited By ~ 2

Author(s):

Riccardo De Bin ◽

Anne-Laure Boulesteix ◽

Axel Benner ◽

Natalia Becker ◽

Willi Sauerbrei

Keyword(s):

Prediction Model ◽

Simulation Study ◽

Prediction Models ◽

Molecular Data ◽

Data Sources ◽

Correlation Structure ◽

High Dimensional ◽

Sources Of Information ◽

Gene Expressions ◽

Low Dimensional

Abstract Data integration, i.e. the use of different sources of information for data analysis, is becoming one of the most important topics in modern statistics. Especially in, but not limited to, biomedical applications, a relevant issue is the combination of low-dimensional (e.g. clinical data) and high-dimensional (e.g. molecular data such as gene expressions) data sources in a prediction model. Not only the different characteristics of the data, but also the complex correlation structure within and between the two data sources, pose challenging issues. In this paper, we investigate these issues via simulations, providing some useful insight into strategies to combine low- and high-dimensional data in a regression prediction model. In particular, we focus on the effect of the correlation structure on the results, while accounting for the influence of our specific choices in the design of the simulation study.

Download Full-text

An overview of techniques for linking high-dimensional molecular data to time-to-event endpoints by risk prediction models

Biometrical Journal ◽

10.1002/bimj.201000152 ◽

2011 ◽

Vol 53 (2) ◽

pp. 170-189 ◽

Cited By ~ 11

Author(s):

Harald Binder ◽

Christine Porzelius ◽

Martin Schumacher

Keyword(s):

Risk Prediction ◽

Prediction Models ◽

Molecular Data ◽

High Dimensional ◽

Time To Event ◽

Risk Prediction Models

Download Full-text

Identification of Multivariate Outliers: A Performance Study

Austrian Journal of Statistics ◽

10.17713/ajs.v34i2.406 ◽

2016 ◽

Vol 34 (2) ◽

Cited By ~ 39

Author(s):

Peter Filzmoser

Keyword(s):

Robust Estimation ◽

Simulation Study ◽

Mahalanobis Distance ◽

High Dimensional ◽

Distributed Data ◽

Performance Study ◽

Multivariate Outliers ◽

Asymmetric Distributions ◽

Heavy Tailed ◽

Low Dimensional

Three methods for the identification of multivariate outliers (Rousseeuw and Van Zomeren, 1990; Becker and Gather, 1999; Filzmoser et al., 2005) are compared. They are based on the Mahalanobis distance that will be made resistant against outliers and model deviations by robust estimation of location and covariance. The comparison is made by means of a simulation study. Not only the case of multivariate normally distributed data, but also heavy tailed and asymmetric distributions will be considered. The simulations are focused on low dimensional (p = 5) and high dimensional (p = 30) data.

Download Full-text

XGBLC: an improved survival prediction model based on XGBoost

Bioinformatics ◽

10.1093/bioinformatics/btab675 ◽

2021 ◽

Author(s):

Baoshan Ma ◽

Ge Yan ◽

Bingjie Chai ◽

Xiaoyu Hou

Keyword(s):

Survival Analysis ◽

Prediction Model ◽

Prediction Models ◽

Expression Profiles ◽

Genomic Data ◽

Supplementary Information ◽

High Dimensional ◽

Brier Score ◽

Survival Prediction ◽

Model Based

Abstract Motivation Survival analysis using gene expression profiles plays a crucial role in the interpretation of clinical research and assessment of disease therapy programs. Several prediction models have been developed to explore the relationship between patients’ covariates and survival. However, the high-dimensional genomic features limit the prediction performance of the survival model. Thus, an accurate and reliable prediction model is necessary for survival analysis using high-dimensional genomic data. Results In this study, we proposed an improved survival prediction model based on XGBoost framework called XGBLC, which used Lasso-Cox to enhance the ability to analyze high-dimensional genomic data. The novel first- and second-order gradient statistics of Lasso-Cox were defined to construct the loss function of XGBLC. We extensively tested our XGBLC algorithm on both simulated and real-world datasets, and estimated the performance of models with 5-fold cross-validation. Based on 20 cancer datasets from The Cancer Genome Atlas (TCGA), XGBLC outperforms five state-of-the-art survival methods in terms of C-index, Brier score and AUC. The results show that XGBLC still keeps good accuracy and robustness by comparing the performance on the simulated datasets with different scales. The developed prediction model would be beneficial for physicians to understand the effects of patient’s genomic characteristics on survival and make personalized treatment decisions. Availability and implementation The implementation of XGBLC algorithm based on R language is available at: https://github.com/lab319/XGBLC Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Bioactivity Prediction Based on Matched Molecular Pair and Matched Molecular Series Methods

Current Pharmaceutical Design ◽

10.2174/1381612826666200427111309 ◽

2020 ◽

Vol 26 (33) ◽

pp. 4195-4205

Author(s):

Xiaoyu Ding ◽

Chen Cui ◽

Dingyan Wang ◽

Jihui Zhao ◽

Mingyue Zheng ◽

...

Keyword(s):

Prediction Model ◽

Large Scale ◽

Prediction Models ◽

Predictive Accuracy ◽

Lead Optimization ◽

Consensus Method ◽

Molecular Pair ◽

Bioactivity Prediction ◽

Compound Synthesis ◽

Consensus Modeling

Background: Enhancing a compound’s biological activity is the central task for lead optimization in small molecules drug discovery. However, it is laborious to perform many iterative rounds of compound synthesis and bioactivity tests. To address the issue, it is highly demanding to develop high quality in silico bioactivity prediction approaches, to prioritize such more active compound derivatives and reduce the trial-and-error process. Methods: Two kinds of bioactivity prediction models based on a large-scale structure-activity relationship (SAR) database were constructed. The first one is based on the similarity of substituents and realized by matched molecular pair analysis, including SA, SA_BR, SR, and SR_BR. The second one is based on SAR transferability and realized by matched molecular series analysis, including Single MMS pair, Full MMS series, and Multi single MMS pairs. Moreover, we also defined the application domain of models by using the distance-based threshold. Results: Among seven individual models, Multi single MMS pairs bioactivity prediction model showed the best performance (R2 = 0.828, MAE = 0.406, RMSE = 0.591), and the baseline model (SA) produced the most lower prediction accuracy (R2 = 0.798, MAE = 0.446, RMSE = 0.637). The predictive accuracy could further be improved by consensus modeling (R2 = 0.842, MAE = 0.397 and RMSE = 0.563). Conclusion: An accurate prediction model for bioactivity was built with a consensus method, which was superior to all individual models. Our model should be a valuable tool for lead optimization.

Download Full-text

Classification of Brainwaves for Sleep Stages by High-Dimensional FFT Features from EEG Signals

Applied Sciences ◽

10.3390/app10051797 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1797 ◽

Cited By ~ 2

Author(s):

Mera Kartika Delimayanti ◽

Bedy Purnama ◽

Ngoc Giang Nguyen ◽

Mohammad Reza Faisal ◽

Kunti Robiatul Mahmudah ◽

...

Keyword(s):

Machine Learning ◽

Sleep Stage ◽

Machine Learning Algorithms ◽

High Dimensional ◽

Sleep Stages ◽

Eeg Signals ◽

Stage Classification ◽

Sleep Stage Classification ◽

Low Dimensional

Manual classification of sleep stage is a time-consuming but necessary step in the diagnosis and treatment of sleep disorders, and its automation has been an area of active study. The previous works have shown that low dimensional fast Fourier transform (FFT) features and many machine learning algorithms have been applied. In this paper, we demonstrate utilization of features extracted from EEG signals via FFT to improve the performance of automated sleep stage classification through machine learning methods. Unlike previous works using FFT, we incorporated thousands of FFT features in order to classify the sleep stages into 2–6 classes. Using the expanded version of Sleep-EDF dataset with 61 recordings, our method outperformed other state-of-the art methods. This result indicates that high dimensional FFT features in combination with a simple feature selection is effective for the improvement of automated sleep stage classification.

Download Full-text

Fire modelling in Tasmanian buttongrass moorlands. III. Dead fuel moisture

International Journal of Wildland Fire ◽

10.1071/wf01025 ◽

2001 ◽

Vol 10 (2) ◽

pp. 241 ◽

Cited By ~ 27

Author(s):

Jon B. Marsden-Smedley ◽

Wendy R. Catchpole

Keyword(s):

Prediction Model ◽

Regression Model ◽

Fire Management ◽

Prediction Models ◽

Dew Point ◽

Seasonal Effects ◽

Experimental Program ◽

Fuel Moisture ◽

Fire Behaviour ◽

Fire Modelling

An experimental program was carried out in Tasmanian buttongrass moorlands to develop fire behaviour prediction models for improving fire management. This paper describes the results of the fuel moisture modelling section of this project. A range of previously developed fuel moisture prediction models are examined and three empirical dead fuel moisture prediction models are developed. McArthur’s grassland fuel moisture model gave equally good predictions as a linear regression model using humidity and dew-point temperature. The regression model was preferred as a prediction model as it is inherently more robust. A prediction model based on hazard sticks was found to have strong seasonal effects which need further investigation before hazard sticks can be used operationally.

Download Full-text

A Genetic Algorithm Optimized RNN-LSTM Model for Remaining Useful Life Prediction of Turbofan Engine

Electronics ◽

10.3390/electronics10030285 ◽

2021 ◽

Vol 10 (3) ◽

pp. 285

Author(s):

Kwok Tai Chui ◽

Brij B. Gupta ◽

Pandian Vasant

Keyword(s):

Genetic Algorithm ◽

Feature Extraction ◽

Prediction Model ◽

Prediction Models ◽

Remaining Useful Life ◽

Prediction Algorithm ◽

Short Term ◽

Turbofan Engine ◽

Term Prediction ◽

Useful Life

Understanding the remaining useful life (RUL) of equipment is crucial for optimal predictive maintenance (PdM). This addresses the issues of equipment downtime and unnecessary maintenance checks in run-to-failure maintenance and preventive maintenance. Both feature extraction and prediction algorithm have played crucial roles on the performance of RUL prediction models. A benchmark dataset, namely Turbofan Engine Degradation Simulation Dataset, was selected for performance analysis and evaluation. The proposal of the combination of complete ensemble empirical mode decomposition and wavelet packet transform for feature extraction could reduce the average root-mean-square error (RMSE) by 5.14–27.15% compared with six approaches. When it comes to the prediction algorithm, the results of the RUL prediction model could be that the equipment needs to be repaired or replaced within a shorter or a longer period of time. Incorporating this characteristic could enhance the performance of the RUL prediction model. In this paper, we have proposed the RUL prediction algorithm in combination with recurrent neural network (RNN) and long short-term memory (LSTM). The former takes the advantages of short-term prediction whereas the latter manages better in long-term prediction. The weights to combine RNN and LSTM were designed by non-dominated sorting genetic algorithm II (NSGA-II). It achieved average RMSE of 17.2. It improved the RMSE by 6.07–14.72% compared with baseline models, stand-alone RNN, and stand-alone LSTM. Compared with existing works, the RMSE improvement by proposed work is 12.95–39.32%.

Download Full-text

The Role of Board Independence and Ownership Structure in Improving the Efficacy of Corporate Financial Distress Prediction Model Evidence from India

Journal of Risk and Financial Management ◽

10.3390/jrfm14070333 ◽

2021 ◽

Vol 14 (7) ◽

pp. 333

Author(s):

Shilpa H. Shetty ◽

Theresa Nithila Vincent

Keyword(s):

Prediction Model ◽

Ownership Structure ◽

Financial Distress ◽

Prediction Models ◽

Receiver Operating Curve ◽

Financial Measures ◽

Financial Variables ◽

Financial Distress Prediction ◽

Distress Prediction

The study aimed to investigate the role of non-financial measures in predicting corporate financial distress in the Indian industrial sector. The proportion of independent directors on the board and the proportion of the promoters’ share in the ownership structure of the business were the non-financial measures that were analysed, along with ten financial measures. For this, sample data consisted of 82 companies that had filed for bankruptcy under the Insolvency and Bankruptcy Code (IBC). An equal number of matching financially sound companies also constituted the sample. Therefore, the total sample size was 164 companies. Data for five years immediately preceding the bankruptcy filing was collected for the sample companies. The data of 120 companies evenly drawn from the two groups of companies were used for developing the model and the remaining data were used for validating the developed model. Two binary logistic regression models were developed, M1 and M2, where M1 was formulated with both financial and non-financial variables, and M2 only had financial variables as predictors. The diagnostic ability of the model was tested with the aid of the receiver operating curve (ROC), area under the curve (AUC), sensitivity, specificity and annual accuracy. The results of the study show that inclusion of the two non-financial variables improved the efficacy of the financial distress prediction model. This study made a unique attempt to provide empirical evidence on the role played by non-financial variables in improving the efficiency of corporate distress prediction models.

Download Full-text

A Nonlinear Maximum Correntropy Information Filter for High-Dimensional Neural Decoding

Entropy ◽

10.3390/e23060743 ◽

2021 ◽

Vol 23 (6) ◽

pp. 743

Author(s):

Xi Liu ◽

Shuhang Chen ◽

Xiang Shen ◽

Xiang Zhang ◽

Yiwen Wang

Keyword(s):

State Estimation ◽

Measurement Model ◽

High Dimensional ◽

Neural Firing ◽

The Neural Network ◽

Information Filter ◽

Critical Technology ◽

Dimensional Measurements ◽

Non Gaussian ◽

Low Dimensional

Neural signal decoding is a critical technology in brain machine interface (BMI) to interpret movement intention from multi-neural activity collected from paralyzed patients. As a commonly-used decoding algorithm, the Kalman filter is often applied to derive the movement states from high-dimensional neural firing observation. However, its performance is limited and less effective for noisy nonlinear neural systems with high-dimensional measurements. In this paper, we propose a nonlinear maximum correntropy information filter, aiming at better state estimation in the filtering process for a noisy high-dimensional measurement system. We reconstruct the measurement model between the high-dimensional measurements and low-dimensional states using the neural network, and derive the state estimation using the correntropy criterion to cope with the non-Gaussian noise and eliminate large initial uncertainty. Moreover, analyses of convergence and robustness are given. The effectiveness of the proposed algorithm is evaluated by applying it on multiple segments of neural spiking data from two rats to interpret the movement states when the subjects perform a two-lever discrimination task. Our results demonstrate better and more robust state estimation performance when compared with other filters.

Download Full-text

Predicting Change Prone Classes in Open Source Software

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2018100101 ◽

2018 ◽

Vol 8 (4) ◽

pp. 1-23 ◽

Cited By ~ 2

Author(s):

Deepa Godara ◽

Amit Choudhary ◽

Rakesh Kumar Singh

Keyword(s):

Prediction Model ◽

Open Source ◽

Open Source Software ◽

Prediction Models ◽

New Technology ◽

Modern Technology ◽

Time Frequency ◽

Rigorous Testing ◽

Technology Changes ◽

Sensitivity Specificity

In today's world, the heart of modern technology is software. In order to compete with pace of new technology, changes in software are inevitable. This article aims at the association between changes and object-oriented metrics using different versions of open source software. Change prediction models can detect the probability of change in a class earlier in the software life cycle which would result in better effort allocation, more rigorous testing and easier maintenance of any software. Earlier, researchers have used various techniques such as statistical methods for the prediction of change-prone classes. In this article, some new metrics such as execution time, frequency, run time information, popularity and class dependency are proposed which can help in prediction of change prone classes. For evaluating the performance of the prediction model, the authors used Sensitivity, Specificity, and ROC Curve. Higher values of AUC indicate the prediction model gives significant accurate results. The proposed metrics contribute to the accurate prediction of change-prone classes.

Download Full-text