Deep Neural Network for Somatic Mutation Classification

Scientific Programming ◽

10.1155/2021/5529202 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Haifeng Wang ◽

Chengche Wang ◽

Hongchun Qu

Keyword(s):

Random Forest ◽

Somatic Mutation ◽

Somatic Mutations ◽

Area Under The Curve ◽

Gradient Boosting ◽

Support Vector ◽

Breast Cancers ◽

Sequencing Data ◽

Feed Forward ◽

Feed Forward Network

The detection and characterization of somatic mutations have become the important means to analyze the occurrence and development of cancer and, ultimately, will help to select effective and precise treatment for specific cancer patients. It is very difficult to detect somatic mutations accurately from the massive sequencing data. In this paper, a forest-graph-embedded deep feed-forward network (forgeNet) is utilized to detect somatic mutations from the sequencing data. In forgeNet, the random forest (RF) or Gradient Boosting Machine (GBM) and graph-embedded deep feed-forward network (GEDFN) are utilized to extract features and implement classification, respectively. Three real somatic mutation datasets collected from 48 triple-negative breast cancers are utilized to test the somatic mutation detection performances of forgeNet. The detection results show that forgeNet could make the 0.05%–0.424% improvements in terms of area under the curve (AUC) compared with support vector machines and random forest.

Download Full-text

802 An electronic health record-based approach to identify and characterize patients with immune checkpoint inhibitor-associated arthritis

Journal for ImmunoTherapy of Cancer ◽

10.1136/jitc-2021-sitc2021.802 ◽

2021 ◽

Vol 9 (Suppl 3) ◽

pp. A838-A839

Author(s):

Steven Tran ◽

Luke Rasmussen ◽

Jennifer Pacheco ◽

Carlos Galvez ◽

Kyle Tegtmeyer ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Logistic Regression ◽

Random Forest ◽

Adverse Events ◽

Operating Characteristic ◽

Area Under The Curve ◽

Gradient Boosting ◽

Support Vector ◽

Health Record

BackgroundImmune checkpoint inhibitors (ICIs) are a pillar of cancer therapy with demonstrated efficacy in a variety of malignancies. However, they are associated with immune-related adverse events (irAEs) that affect many organ systems with varying severity, inhibiting patient quality of life and in some cases the ability to continue immunotherapy. Research into irAEs is nascent, and identifying patients with adverse events poses a critical challenge for future research efforts and patient care. This study's objective was to develop an electronic health record (EHR)-based model to identify and characterize patients with ICI-associated arthritis (checkpoint arthritis).MethodsForty-two patients with checkpoint arthritis were chart abstracted from a cohort of all patients who received checkpoint therapy for cancer (n=2,612) in a single-center retrospective study. All EHR clinical codes (N=32,198) were extracted including International Classification of Diseases (ICD)-9 and ICD-10, Logical Observation Identifiers Names and Codes (LOINC), RxNorm, and Current Procedural Terminology (CPT). Logistic regression, random forest, gradient boosting, support vector machine, K-nearest neighbors, and neural network machine learning models were trained to identify checkpoint arthritis patients using these clinical codes. Models were evaluated using receiver operating characteristic area under the curve (ROC-AUC), and the most important variables were determined from the logistic regression model. Models were retrained on smaller fractions of the important variables to determine the minimum variable set necessary to achieve accurate identification of checkpoint arthritis.ResultsLogistic regression and random forest were the highest performing models on the full variable set of 32,198 clinical codes (AUCs: 0.911, 0.894, respectively) (table 1). Retraining the models on smaller fractions of the most important variables demonstrated peak performance using the top 31 clinical codes, or 0.1% of the total variables (figure 1). The most important features included presence of ESR, CRP, rheumatoid factor lab, prednisone, joint pain, creatine kinase lab, thyroid labs, and immunization, all positively associated with checkpoint arthritis (figure 2).ConclusionsOur study demonstrates that a data-driven, EHR based approach can robustly identify checkpoint arthritis patients. The high performance of the models using only the 0.1% most important variables suggests that only a small number of clinical attributes are needed to identify these patients. The variables most important for identifying checkpoint arthritis included several unexpected clinical features, such as thyroid labs and immunization, indicating potential underlying irAE associations that warrant further exploration. Finally, the flexibility of this approach and its demonstrated effectiveness could be applied to identify and characterize other irAEs.Ethics ApprovalThis study was approved by the Northwestern University Institutional Review Board, ID STU00210502, with a granted waiver of consentAbstract 802 Table 1Model performance metricsAUC was calculated from the ROC curve. Sensitivity, specificity, PPV, and NPV were determined at the threshold maximizing the F1-score. AUC = area under the curve, ROC = receiver operating characteristic, PPV = positive predictive value, NPV = negative predictive valueAbstract 802 Figure 1Model AUC trained on decreasing fractions of the most important variables, determined by the random forest model. 100% = 32,198 clinical codes. LReg = logistic regression, RF = random forest, GB = gradient boosting, NN = neural network, KNN = K-nearest neighbor, SVM = support vector machine, SVMAnom = SVM anomaly detectionAbstract 802 Figure 2The 31 most important variables determined by the logistic regression (A, coefficients) and random forest (B, relative importance) models

Download Full-text

Investigating the use of random forest, gradient boosting machine, support vector machine and their ensemble applied to fault detection

10.26678/abcm.cobem2017.cob17-1600 ◽

2017 ◽

Author(s):

Luis Felipe Nogoseke ◽

Gabriel Herman Bernardim Andrade ◽

Marco Boaretto ◽

Leandro Coelho

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Fault Detection ◽

Gradient Boosting ◽

Support Vector ◽

Gradient Boosting Machine

Download Full-text

Utilizing Data-Driven Models to Predict Brittleness in Tuscaloosa Marine Shale: A Machine Learning Approach

10.2118/208628-stu ◽

2021 ◽

Author(s):

Jamal Ahmadov

Keyword(s):

Machine Learning ◽

Random Forest ◽

Brittleness Index ◽

Estimation Methods ◽

Gradient Boosting ◽

Average Error ◽

Support Vector ◽

Marine Shale ◽

Effective Manner ◽

Selection Of

Abstract The Tuscaloosa Marine Shale (TMS) formation is a clay- and liquid-rich emerging shale play across central Louisiana and southwest Mississippi with recoverable resources of 1.5 billion barrels of oil and 4.6 trillion cubic feet of gas. The formation poses numerous challenges due to its high average clay content (50 wt%) and rapidly changing mineralogy, making the selection of fracturing candidates a difficult task. While brittleness plays an important role in screening potential intervals for hydraulic fracturing, typical brittleness estimation methods require the use of geomechanical and mineralogical properties from costly laboratory tests. Machine Learning (ML) can be employed to generate synthetic brittleness logs and therefore, may serve as an inexpensive and fast alternative to the current techniques. In this paper, we propose the use of machine learning to predict the brittleness index of Tuscaloosa Marine Shale from conventional well logs. We trained ML models on a dataset containing conventional and brittleness index logs from 8 wells. The latter were estimated either from geomechanical logs or log-derived mineralogy. Moreover, to ensure mechanical data reliability, dynamic-to-static conversion ratios were applied to Young's modulus and Poisson's ratio. The predictor features included neutron porosity, density and compressional slowness logs to account for the petrophysical and mineralogical character of TMS. The brittleness index was predicted using algorithms such as Linear, Ridge and Lasso Regression, K-Nearest Neighbors, Support Vector Machine (SVM), Decision Tree, Random Forest, AdaBoost and Gradient Boosting. Models were shortlisted based on the Root Mean Square Error (RMSE) value and fine-tuned using the Grid Search method with a specific set of hyperparameters for each model. Overall, Gradient Boosting and Random Forest outperformed other algorithms and showed an average error reduction of 5 %, a normalized RMSE of 0.06 and a R-squared value of 0.89. The Gradient Boosting was chosen to evaluate the test set and successfully predicted the brittleness index with a normalized RMSE of 0.07 and R-squared value of 0.83. This paper presents the practical use of machine learning to evaluate brittleness in a cost and time effective manner and can further provide valuable insights into the optimization of completion in TMS. The proposed ML model can be used as a tool for initial screening of fracturing candidates and selection of fracturing intervals in other clay-rich and heterogeneous shale formations.

Download Full-text

Predictive models for stage and risk classification in head and neck squamous cell carcinoma (HNSCC)

PeerJ ◽

10.7717/peerj.9656 ◽

2020 ◽

Vol 8 ◽

pp. e9656

Author(s):

Sugandh Kumar ◽

Srinivas Patnaik ◽

Anshuman Dixit

Keyword(s):

Machine Learning ◽

Expression Profiles ◽

Disease Process ◽

Penalized Regression ◽

Functional Enrichment ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Sequencing Data ◽

Therapeutic Modalities

Machine learning techniques are increasingly used in the analysis of high throughput genome sequencing data to better understand the disease process and design of therapeutic modalities. In the current study, we have applied state of the art machine learning (ML) algorithms (Random Forest (RF), Support Vector Machine Radial Kernel (svmR), Adaptive Boost (AdaBoost), averaged Neural Network (avNNet), and Gradient Boosting Machine (GBM)) to stratify the HNSCC patients in early and late clinical stages (TNM) and to predict the risk using miRNAs expression profiles. A six miRNA signature was identified that can stratify patients in the early and late stages. The mean accuracy, sensitivity, specificity, and area under the curve (AUC) was found to be 0.84, 0.87, 0.78, and 0.82, respectively indicating the robust performance of the generated model. The prognostic signature of eight miRNAs was identified using LASSO (least absolute shrinkage and selection operator) penalized regression. These miRNAs were found to be significantly associated with overall survival of the patients. The pathway and functional enrichment analysis of the identified biomarkers revealed their involvement in important cancer pathways such as GP6 signalling, Wnt signalling, p53 signalling, granulocyte adhesion, and dipedesis. To the best of our knowledge, this is the first such study and we hope that these signature miRNAs will be useful for the risk stratification of patients and the design of therapeutic modalities.

Download Full-text

Algorithmic and data modeling: Will algorithmic modeling improve predictions of traits evaluated on ordinal scales?

10.1101/2020.10.07.329466 ◽

2020 ◽

Author(s):

Zhanyou Xu ◽

Andreomar Kurek ◽

Steven B. Cannon ◽

Williams D. Beavis

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Ridge Regression ◽

Genomic Prediction ◽

Ordinal Data ◽

Prediction Models ◽

Characteristic Curve ◽

Gradient Boosting ◽

Support Vector ◽

Data Types

AbstractSelection of markers linked to alleles at quantitative trait loci (QTL) for tolerance to Iron Deficiency Chlorosis (IDC) has not been successful. Genomic selection has been advocated for continuous numeric traits such as yield and plant height. For ordinal data types such as IDC, genomic prediction models have not been systematically compared. The objectives of research reported in this manuscript were to evaluate the most commonly used genomic prediction method, ridge regression and it’s equivalent logistic ridge regression method, with algorithmic modeling methods including random forest, gradient boosting, support vector machine, K-nearest neighbors, Naïve Bayes, and artificial neural network using the usual comparator metric of prediction accuracy. In addition we compared the methods using metrics of greater importance for decisions about selecting and culling lines for use in variety development and genetic improvement projects. These metrics include specificity, sensitivity, precision, decision accuracy, and area under the receiver operating characteristic curve. We found that Support Vector Machine provided the best specificity for culling IDC susceptible lines, while Random Forest GP models provided the best combined set of decision metrics for retaining IDC tolerant and culling IDC susceptible lines.

Download Full-text

Machine learning as a successful approach for predicting complex spatio–temporal patterns in animal species abundance

Animal Biodiversity and Conservation ◽

10.32800/abc.2021.44.0289 ◽

2021 ◽

pp. 289-301

Author(s):

B. Martín ◽

J. González–Arias ◽

J. A. Vicente–Vírseda

Keyword(s):

Machine Learning ◽

Random Forest ◽

Animal Species ◽

Temporal Patterns ◽

Additive Models ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Spatio Temporal

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Sequence-Based Discovery of Antibacterial Peptides Using Ensemble Gradient Boosting

Proceedings ◽

10.3390/proceedings2020066006 ◽

2020 ◽

Vol 66 (1) ◽

pp. 6

Author(s):

Ehdieh Khaledian ◽

Shira L. Broschat

Keyword(s):

Operating Characteristic ◽

Area Under The Curve ◽

Roc Curves ◽

Gradient Boosting ◽

Support Vector ◽

Antibacterial Peptides ◽

Therapeutic Approaches ◽

Laboratory Procedures ◽

Feature Selection Techniques ◽

Better Than

Antimicrobial resistance is driving pharmaceutical companies to investigate different therapeutic approaches. One approach that has garnered growing consideration in drug development is the use of antimicrobial peptides (AMPs). Antibacterial peptides (ABPs), which occur naturally as part of the immune response, can serve as powerful, broad-spectrum antibiotics. However, conventional laboratory procedures for screening and discovering ABPs are expensive and time-consuming. Identification of ABPs can be significantly improved using computational methods. In this paper, we introduce a machine learning method for the fast and accurate prediction of ABPs. We gathered more than 6000 peptides from publicly available datasets and extracted 1209 features (peptide characteristics) from these sequences. We selected the set of optimal features by applying correlation-based and random forest feature selection techniques. Finally, we designed an ensemble gradient boosting model (GBM) to predict putative ABPs. We evaluated our model using receiver operating characteristic (ROC) curves, calculating the area under the curve (AUC) for several different models for comparison, including a recurrent neural network, a support vector machine, and iAMPpred. The AUC for the GBM was ~0.98, more than 3% better than any of the other models.

Download Full-text

Machine Learning Framework to Predict Chronic Kidney Disease using Ensemble Algorithm

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.d9107.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 1-6

Keyword(s):

Machine Learning ◽

Chronic Kidney Disease ◽

Random Forest ◽

Kidney Disease ◽

Decision Tree ◽

Performance Metrics ◽

Weighted Average ◽

Gradient Boosting ◽

Support Vector ◽

The Individual

Chronic Kidney Disease (CKD) is a worldwide concern that influences roughly 10% of the grown-up population on the world. For most of the people the early diagnosis of CKD is often not possible. Therefore, the utilization of present-day Computer aided supported strategies is important to help the conventional CKD finding framework to be progressively effective and precise. In this project, six modern machine learning techniques namely Multilayer Perceptron Neural Network, Support Vector Machine, Naïve Bayes, K-Nearest Neighbor, Decision Tree, Logistic regression were used and then to enhance the performance of the model Ensemble Algorithms such as ADABoost, Gradient Boosting, Random Forest, Majority Voting, Bagging and Weighted Average were used on the Chronic Kidney Disease dataset from the UCI Repository. The model was tuned finely to get the best hyper parameters to train the model. The performance metrics used to evaluate the model was measured using Accuracy, Precision, Recall, F1-score, Mathew`s Correlation Coefficient and ROC-AUC curve. The experiment was first performed on the individual classifiers and then on the Ensemble classifiers. The ensemble classifier like Random Forest and ADABoost performed better with 100% Accuracy, Precision and Recall when compared to the individual classifiers with 99.16% accuracy, 98.8% Precision and 100% Recall obtained from Decision Tree Algorithm

Download Full-text

Modelos de machine learning para predição do sucesso de startups

Revista de Gestão e Projetos ◽

10.5585/gep.v12i2.18942 ◽

2021 ◽

Vol 12 (2) ◽

pp. 28-55

Author(s):

Fabiano Rodrigues ◽

Francisco Aparecido Rodrigues ◽

Thelma Valéria Rocha Rodrigues

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Decision Tree ◽

Initial Public Offering ◽

Gradient Boosting ◽

Support Vector ◽

Trade Offs ◽

Extreme Gradient Boosting ◽

Public Offering

Este estudo analisa resultados obtidos com modelos de machine learning para predição do sucesso de startups. Como proxy de sucesso considera-se a perspectiva do investidor, na qual a aquisição da startup ou realização de IPO (Initial Public Offering) são formas de recuperação do investimento. A revisão da literatura aborda startups e veículos de financiamento, estudos anteriores sobre predição do sucesso de startups via modelos de machine learning, e trade-offs entre técnicas de machine learning. Na parte empírica, foi realizada uma pesquisa quantitativa baseada em dados secundários oriundos da plataforma americana Crunchbase, com startups de 171 países. O design de pesquisa estabeleceu como filtro startups fundadas entre junho/2010 e junho/2015, e uma janela de predição entre junho/2015 e junho/2020 para prever o sucesso das startups. A amostra utilizada, após etapa de pré-processamento dos dados, foi de 18.571 startups. Foram utilizados seis modelos de classificação binária para a predição: Regressão Logística, Decision Tree, Random Forest, Extreme Gradiente Boosting, Support Vector Machine e Rede Neural. Ao final, os modelos Random Forest e Extreme Gradient Boosting apresentaram os melhores desempenhos na tarefa de classificação. Este artigo, envolvendo machine learning e startups, contribui para áreas de pesquisa híbridas ao mesclar os campos da Administração e Ciência de Dados. Além disso, contribui para investidores com uma ferramenta de mapeamento inicial de startups na busca de targets com maior probabilidade de sucesso.

Download Full-text