scholarly journals A Comparison Study of Cox Models and Machine Learning Methods for Developing Breast Cancer Prognostic Prediction Models (Preprint)

2021 ◽  
Author(s):  
Jialong Xiao ◽  
Miao Mo ◽  
Zezhou Wang ◽  
Changming Zhou ◽  
Jie Shen ◽  
...  

BACKGROUND Over recent years, machine learning (ML) methods have been increasingly explored in cancer prognosis prediction because of the appearance of improved machine learning algorithms. These algorithms can use censored data for modeling, such as support vector machines (SVM) for survival analysis and random survival forest (RSF). However, it is still debated whether traditional (Cox proportional hazard regression) or ML-based prognostic prediction models have better predictive performance. OBJECTIVE This study aims to use the machine learning algorithms to predict the survival of breast cancer and compare the predictive performance with the traditional Cox regression. METHODS This retrospective cohort study included all patients diagnosed with breast cancer and subsequently hospitalized in Fudan University Shanghai Cancer Center (FUSCC) between January 1, 2008 and December 31, 2016. A total of 25267 cases with 21 features were eligible for model development, and the data set was randomly split into a train set (70%) and a test set (30%) for developing four models and predicting overall survival in breast cancer patients. The discriminative ability of models was evaluated by the concordance index (C-index) and the time-dependent area under the curve (AUC); the calibration ability of models was evaluated by the Brier score. RESULTS The RSF model revealed the best discriminative performance among the four models with 3-year, 5-year and 10-year time-dependent AUC of 0.857, 0.838 and 0.781, respectively and C-index of 0.827 (0.809, 0.845), which significantly outperformed the Cox-EN model (0.816, p=0.007), the Cox model (0.814, p=0.003) and the SVM model (0.812, p<0.001). The four models' 3-year, 5-year, and 10-year brier scores were very close, ranging from 0.027 to 0.094, which meant all models had good calibration. In the context of feature importance, elastic net and RSF both indicated that TNM staging, neoadjuvant therapy, number of lymph node metastases, age, and tumor diameter were the top 5 important features for predicting the prognosis of breast cancer. A final online tool was developed to predict the overall survival of breast cancer patients. CONCLUSIONS RSF model slightly outperformed the other models on discriminative ability, revealing the great potential to be used as an effective approach for survival analysis. CLINICALTRIAL ClinicalTrials. gov, registration number: NCT04996732.

2020 ◽  
Author(s):  
Bum-Sup Jang ◽  
In Ah Kim

Abstract Background: Using by machine learning algorithms, we aimed to identify the mutated gene set from the whole exome sequencing (WES) data of blood in the cancer, which is associated with overall survival in breast cancer patients.Methods: WES data from 1,181 female breast cancer patients within the UK Biobank cohort was collected. The number of mutations for each gene was summed and defined as the blood-based mutation burden per patient. Using by Long short-term memory (LSTM) machine learning algorithm and a XGBoost—a gradient-boosted tree algorithm, we developed the model to predict patient overall survival. Results: From the UK biobank-breast cancer cohort, most altered genes in blood samples were related with the TP53 pathway. In the LSTM model, the minimum 50 genes were found to predict high vs. low mutation burden. In the XGBoost survival model, the gene-set could predict overall survival showing the concordance index of 0.75 and the scaled Brier-score of 0.146 from the held-out testing set (20%, N=236). In older patients (≥ 56 years), the high mutation group based on this gene-set showed inferior overall survival compared to the low mutation group (log-rank test, P=0.042)Conclusion: The machine learning algorithms revealed the gene-signature in the UK biobank breast cancer cohort. Mutational burden observed in blood was associated with overall survival in relatively old patients. This gene-signature should be verified in prospective setting.


2020 ◽  
Vol 22 (1) ◽  
Author(s):  
Kyung-Min Lee ◽  
Hyebin Lee ◽  
Dohyun Han ◽  
Woo Kyung Moon ◽  
Kwangsoo Kim ◽  
...  

Abstract Background Chemotherapy is the standard treatment for breast cancer; however, the response to chemotherapy is disappointingly low. Here, we investigated the alternative therapeutic efficacy of novel combination treatment with necroptosis-inducing small molecules to overcome chemotherapeutic resistance in tyrosine aminoacyl-tRNA synthetase (YARS)-positive breast cancer. Methods Pre-chemotherapeutic needle biopsy of 143 invasive ductal carcinomas undergoing the same chemotherapeutic regimen was subjected to proteomic analysis. Four different machine learning algorithms were employed to determine signature protein combinations. Immunoreactive markers were selected using three common candidate proteins from the machine-learning algorithms and verified by immunohistochemistry using 123 cases of independent needle biopsy FFPE samples. The regulation of chemotherapeutic response and necroptotic cell death was assessed using lentiviral YARS overexpression and depletion 3D spheroid formation assay, viability assays, LDH release assay, flow cytometry analysis, and transmission electron microscopy. The ROS-induced metabolic dysregulation and phosphorylation of necrosome complex by YARS were assessed using oxygen consumption rate analysis, flow cytometry analysis, and 3D cell viability assay. The therapeutic roles of SMAC mimetics (LCL161) and a pan-BCL2 inhibitor (ABT-263) were determined by 3D cell viability assay and flow cytometry analysis. Additional biologic process and protein-protein interaction pathway analysis were performed using Gene Ontology annotation and Cytoscape databases. Results YARS was selected as a potential biomarker by proteomics-based machine-learning algorithms and was exclusively associated with good response to chemotherapy by subsequent immunohistochemical validation. In 3D spheroid models of breast cancer cell lines, YARS overexpression significantly improved chemotherapy response via phosphorylation of the necrosome complex. YARS-induced necroptosis sequentially mediated mitochondrial dysfunction through the overproduction of ROS in breast cancer cell lines. Combination treatment with necroptosis-inducing small molecules, including a SMAC mimetic (LCL161) and a pan-BCL2 inhibitor (ABT-263), showed therapeutic efficacy in YARS-overexpressing breast cancer cells. Conclusions Our results indicate that, before chemotherapy, an initial screening of YARS protein expression should be performed, and YARS-positive breast cancer patients might consider the combined treatment with LCL161 and ABT-263; this could be a novel stepwise clinical approach to apply new targeted therapy in breast cancer patients in the future.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Bum-Joo Cho ◽  
Kyoung Min Kim ◽  
Sanchir-Erdene Bilegsaikhan ◽  
Yong Joon Suh

Abstract Febrile neutropenia (FN) is one of the most concerning complications of chemotherapy, and its prediction remains difficult. This study aimed to reveal the risk factors for and build the prediction models of FN using machine learning algorithms. Medical records of hospitalized patients who underwent chemotherapy after surgery for breast cancer between May 2002 and September 2018 were selectively reviewed for development of models. Demographic, clinical, pathological, and therapeutic data were analyzed to identify risk factors for FN. Using machine learning algorithms, prediction models were developed and evaluated for performance. Of 933 selected inpatients with a mean age of 51.8 ± 10.7 years, FN developed in 409 (43.8%) patients. There was a significant difference in FN incidence according to age, staging, taxane-based regimen, and blood count 5 days after chemotherapy. The area under the curve (AUC) built based on these findings was 0.870 on the basis of logistic regression. The AUC improved by machine learning was 0.908. Machine learning improves the prediction of FN in patients undergoing chemotherapy for breast cancer compared to the conventional statistical model. In these high-risk patients, primary prophylaxis with granulocyte colony-stimulating factor could be considered.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Yingyi Hao ◽  
Li He ◽  
Yifan Zhou ◽  
Yiru Zhao ◽  
Menglong Li ◽  
...  

In clinical cancer research, it is a hot topic on how to accurately stratify patients based on genomic data. With the development of next-generation sequencing technology, more and more types of genomic features, such as mRNA expression level, can be used to distinguish cancer patients. Previous studies commonly stratified patients by using a single type of genomic features, which can only reflect one aspect of the cancer. In fact, multiscale genomic features will provide more information and may be helpful for clinical prediction. In addition, most of the conventional machine learning algorithms use a handcrafted gene set as features to construct models, which is generally selected by a statistical method with an arbitrary cut-off, e.g., p value < 0.05. The genes in the gene set are not necessarily related to the cancer and will make the model unreliable. Therefore, in our study, we thoroughly investigated the performance of different machine learning methods on stratifying breast cancer patients with a single type of genomic features. Then, we proposed a strategy, which can take into account the degree of correlation between genes and cancer patients, to identify the features from mRNAs and microRNAs, and evaluated the performance of the models with the new combined features of the multiscale genomic features. The results showed that, compared with the models constructed with a single type of features, the models with the multiscale genomic features generated by our proposed method achieved better performance on stratifying the ER status of breast cancer patients. Moreover, we found that the identified multiscale genomic features were closely related to the cancer by gene set enrichment analysis, indicating that our proposed strategy can well reflect the biological relevance of the genes to breast cancer. In conclusion, modelling with multiscale genomic features closely related to the cancer not only can guarantee the prediction performance of the models but also can effectively provide candidate genes for interpreting the mechanisms of cancer.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e12567-e12567
Author(s):  
Hao Yu ◽  
Fang Chen ◽  
Li Yang ◽  
Jian-Yue Jin ◽  
Feng-Ming Spring Kong

e12567 Background: Radiation-induced lymphopenia accompanied with radiation therapy is associated with inferior clinical outcomes in a wide variety of solid malignancies. This study aimed to examine the potential determines of radiation-induced lymphocyte decrease and radiation-induced lymphopenia in breast cancer patients who underwent radiotherapy. Methods: Patients with breast cancer treated who underwent radiotherapy were enrolled in University of Hong Kong-Shenzhen Hospital (our cohort). Circulating lymphocyte levels were evaluated within 7 days prior to and end of radiation therapy. Feature groups including clinical data, tumor characteristics, radiotherapy dosimetrics, treatment regiments were also collected. We applied machine learning algorithms (Extreme Gradient Boosting, XGboost) to predict the ratio of lymphocyte level after radiotherapy to baseline lymphocyte level and the event of lymphopenia and compared with Lasso regression approaches. Next, we used Shapley additive explanation (SHAP) to explore the directional contribution of each feature for lymphocyte decrease and lymphopenia. For the purpose of model validation and proof-of-concept validation, an independent cohort of patients enrolled in prospective trial was eligible (IP cohort). Results: A total of 589 patients were enrolled in our cohort and 203 patients in IP cohort. XGboost models which trained in our cohort with performances of a mean RMSE: 0.157 and R2: 53.9% for the ratio of lymphocyte levels; a mean accuracy: 0.757 and ROC-AUC: 0.733 for the lymphopenia events, separately. These models can predict the ratio of lymphocyte levels with a mean RMSE: 0.175 and R2: 47%; predict the lymphopenia events with a mean accuracy: 0.739 and ROC-AUC: 0.737 in the totally independent IP cohort. The feature group of dosimetrics had the largest predictive power with RMSE: 0.192, R2: 29.8%, accuracy: 0.678 and ROC-AUC: 0.667; followed by the group of baseline blood cells with predictive power as RMSE: 0.207, R2: 18.9%, accuracy: 0.669 and ROC-AUC: 0.645. Next, by SHAP value analysis, we investigated that integral dose of the total body, V5 dose, mean lung dose and V20 dose of ipsilateral lung/bilateral lungs were in consequence important promote factors for lymphocyte decrease and for the event of lymphopenia, while the features of baseline monocyte, mean heart dose and tumor size played a role of protection at some extend. Conclusions: In this study, we constructed robust XGboost models for predicting the lymphocyte decrease and the event of lymphopenia in breast cancer patients who underwent radiation therapy. We also applied SHAP analysis for revealing the directional contribution of features. These results are important either for the understanding the contributions of dosimetrics on immune response or for the refine of radiation dosimetrics before treatment in future clinical usages.


2021 ◽  
Vol 12 (4) ◽  
pp. 117-137
Author(s):  
Mazen Mobtasem El-Lamey ◽  
Mohab Mohammed Eid ◽  
Muhammad Gamal ◽  
Nour-Elhoda Mohamed Bishady ◽  
Ali Wagdy Mohamed

There are many cancer patients, especially breast cancer patients as it is the most common type of cancer. Due to the huge number of breast cancer patients, many breast cancer-focused hospitals aren't able to process the huge number of patients and might expose some women to late stages of cancer. Thus, the automation of the process can help these hospitals in speeding up the process of cancer detection. In this paper, the authors test several machine learning models such as k-nearest neighbours (KNN), support vector machine (SVM), and artificial neural network (ANN). They then compare their accuracies and losses with themselves and other models that have been developed by other researchers to see whether their approach is efficient or not and to decide what machine learning algorithm is best to use.


2021 ◽  
Author(s):  
Nuno Moniz ◽  
Susana Barbosa

&lt;p&gt;The Dansgaard-Oeschger (DO) events are one of the most striking examples of abrupt climate change in the Earth's history, representing temperature oscillations of about 8 to 16 degrees Celsius within a few decades. DO events have been studied extensively in paleoclimatic records, particularly in ice core proxies. Examples include the Greenland NGRIP record of oxygen isotopic composition.&lt;br&gt;This work addresses the anticipation of DO events using machine learning algorithms. We consider the NGRIP time series from 20 to 60 kyr b2k with the GICC05 timescale and 20-year temporal resolution. Forecasting horizons range from 0 (nowcasting) to 400 years. We adopt three different machine learning algorithms (random forests, support vector machines, and logistic regression) in training windows of 5 kyr. We perform validation on subsequent test windows of 5 kyr, based on timestamps of previous DO events' classification in Greenland by Rasmussen et al. (2014). We perform experiments with both sliding and growing windows.&lt;br&gt;Results show that predictions on sliding windows are better overall, indicating that modelling is affected by non-stationary characteristics of the time series. The three algorithms' predictive performance is similar, with a slightly better performance of random forest models for shorter forecast horizons. The prediction models' predictive capability decreases as the forecasting horizon grows more extensive but remains reasonable up to 120 years. Model performance deprecation is mostly related to imprecision in accurately determining the start and end time of events and identifying some periods as DO events when such is not valid.&lt;/p&gt;


2019 ◽  
Author(s):  
Herdiantri Sufriyana ◽  
Atina Husnayain ◽  
Ya-Lin Chen ◽  
Chao-Yang Kuo ◽  
Onkar Singh ◽  
...  

BACKGROUND Predictions in pregnancy care are complex because of interactions among multiple factors. Hence, pregnancy outcomes are not easily predicted by a single predictor using only one algorithm or modeling method. OBJECTIVE This study aims to review and compare the predictive performances between logistic regression (LR) and other machine learning algorithms for developing or validating a multivariable prognostic prediction model for pregnancy care to inform clinicians’ decision making. METHODS Research articles from MEDLINE, Scopus, Web of Science, and Google Scholar were reviewed following several guidelines for a prognostic prediction study, including a risk of bias (ROB) assessment. We report the results based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Studies were primarily framed as PICOTS (population, index, comparator, outcomes, timing, and setting): Population: men or women in procreative management, pregnant women, and fetuses or newborns; Index: multivariable prognostic prediction models using non-LR algorithms for risk classification to inform clinicians’ decision making; Comparator: the models applying an LR; Outcomes: pregnancy-related outcomes of procreation or pregnancy outcomes for pregnant women and fetuses or newborns; Timing: pre-, inter-, and peripregnancy periods (predictors), at the pregnancy, delivery, and either puerperal or neonatal period (outcome), and either short- or long-term prognoses (time interval); and Setting: primary care or hospital. The results were synthesized by reporting study characteristics and ROBs and by random effects modeling of the difference of the logit area under the receiver operating characteristic curve of each non-LR model compared with the LR model for the same pregnancy outcomes. We also reported between-study heterogeneity by using <i>τ<sup>2</sup></i> and <i>I<sup>2</sup></i>. RESULTS Of the 2093 records, we included 142 studies for the systematic review and 62 studies for a meta-analysis. Most prediction models used LR (92/142, 64.8%) and artificial neural networks (20/142, 14.1%) among non-LR algorithms. Only 16.9% (24/142) of studies had a low ROB. A total of 2 non-LR algorithms from low ROB studies significantly outperformed LR. The first algorithm was a random forest for preterm delivery (logit AUROC 2.51, 95% CI 1.49-3.53; <i>I<sup>2</sup></i>=86%; <i>τ<sup>2</sup></i>=0.77) and pre-eclampsia (logit AUROC 1.2, 95% CI 0.72-1.67; <i>I<sup>2</sup></i>=75%; <i>τ<sup>2</sup></i>=0.09). The second algorithm was gradient boosting for cesarean section (logit AUROC 2.26, 95% CI 1.39-3.13; <i>I<sup>2</sup></i>=75%; <i>τ<sup>2</sup></i>=0.43) and gestational diabetes (logit AUROC 1.03, 95% CI 0.69-1.37; <i>I<sup>2</sup></i>=83%; <i>τ<sup>2</sup></i>=0.07). CONCLUSIONS Prediction models with the best performances across studies were not necessarily those that used LR but also used random forest and gradient boosting that also performed well. We recommend a reanalysis of existing LR models for several pregnancy outcomes by comparing them with those algorithms that apply standard guidelines. CLINICALTRIAL PROSPERO (International Prospective Register of Systematic Reviews) CRD42019136106; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=136106


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Roqia Saleem Awad Maabreh ◽  
Malik Bader Alazzam ◽  
Ahmed S. AlGhamdi

Today, cancer is the second leading cause of death worldwide, and the number of people diagnosed with the disease is expected to rise. Breast cancer is the most commonly diagnosed cancer in women, and it has one of the highest survival rates when treated properly. Because the effectiveness and, as a result, survival of the patient are dependent on each case, it is critical to know the modelling of their survival ahead of time. Artificial intelligence is a rapidly expanding field, and its clinical applications are following suit (having surpassed humans in many evidence-based medical tasks). From the inception of since first stable risk estimator based on statistical methods appeared in survival analysis, there have been numerous versions of it created, with machine learning being used in only a few of them. Nonlinear relationships between variables and the impact they have on the variable to be predicted are very easy to evaluate using statistical methods. However, because they are just mathematical equations, they have flaws that limit the quality of their output. The main goal of this study is to find the best machine learning algorithms for predicting the individualised survival of breast cancer patients, as well as the most appropriate treatment, and to propose new numerical variable stratifications. They will still be carried out using unsupervised machine learning methods that divide patients into groups based on their risk in each dataset. We will compare it to standard groupings to see if it has more significance. Knowing that the greatest challenge in dealing with clinical data is its quantity and quality, we have gone to great lengths to ensure their quality before replicating them. We used the Cox statistical method in conjunction with other statistical methods and tests to find the best possible dataset with which to train our model, despite its ease of multivariate analysis.


Sign in / Sign up

Export Citation Format

Share Document