scholarly journals Splitting on categorical predictors in random forests

PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e6339 ◽  
Author(s):  
Marvin N. Wright ◽  
Inke R. König

One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2k − 1 − 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.

2021 ◽  
Vol 12 ◽  
Author(s):  
Angeliki G. Vittoraki ◽  
Asimina Fylaktou ◽  
Katerina Tarassi ◽  
Zafeiris Tsinaris ◽  
Alexandra Siorenta ◽  
...  

Detection of alloreactive anti-HLA antibodies is a frequent and mandatory test before and after organ transplantation to determine the antigenic targets of the antibodies. Nowadays, this test involves the measurement of fluorescent signals generated through antibody–antigen reactions on multi-beads flow cytometers. In this study, in a cohort of 1,066 patients from one country, anti-HLA class I responses were analyzed on a panel of 98 different antigens. Knowing that the immune system responds typically to “shared” antigenic targets, we studied the clustering patterns of antibody responses against HLA class I antigens without any a priori hypothesis, applying two unsupervised machine learning approaches. At first, the principal component analysis (PCA) projections of intra-locus specific responses showed that anti-HLA-A and anti-HLA-C were the most distantly projected responses in the population with the anti-HLA-B responses to be projected between them. When PCA was applied on the responses against antigens belonging to a single locus, some already known groupings were confirmed while several new cross-reactive patterns of alloreactivity were detected. Anti-HLA-A responses projected through PCA suggested that three cross-reactive groups accounted for about 70% of the variance observed in the population, while anti-HLA-B responses were mainly characterized by a distinction between previously described Bw4 and Bw6 cross-reactive groups followed by several yet undocumented or poorly described ones. Furthermore, anti-HLA-C responses could be explained by two major cross-reactive groups completely overlapping with previously described C1 and C2 allelic groups. A second feature-based analysis of all antigenic specificities, projected as a dendrogram, generated a robust measure of allelic antigenic distances depicting bead-array defined cross reactive groups. Finally, amino acid combinations explaining major population specific cross-reactive groups were described. The interpretation of the results was based on the current knowledge of the antigenic targets of the antibodies as they have been characterized either experimentally or computationally and appear at the HLA epitope registry.


Author(s):  
Michael Withnall ◽  
Edvard Lindelöf ◽  
Ola Engkvist ◽  
Hongming Chen

We introduce Attention and Edge Memory schemes to the existing Message Passing Neural Network framework for graph convolution, and benchmark our approaches against eight different physical-chemical and bioactivity datasets from the literature. We remove the need to introduce <i>a priori</i> knowledge of the task and chemical descriptor calculation by using only fundamental graph-derived properties. Our results consistently perform on-par with other state-of-the-art machine learning approaches, and set a new standard on sparse multi-task virtual screening targets. We also investigate model performance as a function of dataset preprocessing, and make some suggestions regarding hyperparameter selection.


2019 ◽  
Author(s):  
Oskar Flygare ◽  
Jesper Enander ◽  
Erik Andersson ◽  
Brjánn Ljótsson ◽  
Volen Z Ivanov ◽  
...  

**Background:** Previous attempts to identify predictors of treatment outcomes in body dysmorphic disorder (BDD) have yielded inconsistent findings. One way to increase precision and clinical utility could be to use machine learning methods, which can incorporate multiple non-linear associations in prediction models. **Methods:** This study used a random forests machine learning approach to test if it is possible to reliably predict remission from BDD in a sample of 88 individuals that had received internet-delivered cognitive behavioral therapy for BDD. The random forest models were compared to traditional logistic regression analyses. **Results:** Random forests correctly identified 78% of participants as remitters or non-remitters at post-treatment. The accuracy of prediction was lower in subsequent follow-ups (68%, 66% and 61% correctly classified at 3-, 12- and 24-month follow-ups, respectively). Depressive symptoms, treatment credibility, working alliance, and initial severity of BDD were among the most important predictors at the beginning of treatment. By contrast, the logistic regression models did not identify consistent and strong predictors of remission from BDD. **Conclusions:** The results provide initial support for the clinical utility of machine learning approaches in the prediction of outcomes of patients with BDD. **Trial registration:** ClinicalTrials.gov ID: NCT02010619.


2018 ◽  
Author(s):  
Peter De Wolf ◽  
Zhuangqun Huang ◽  
Bede Pittenger

Abstract Methods are available to measure conductivity, charge, surface potential, carrier density, piezo-electric and other electrical properties with nanometer scale resolution. One of these methods, scanning microwave impedance microscopy (sMIM), has gained interest due to its capability to measure the full impedance (capacitance and resistive part) with high sensitivity and high spatial resolution. This paper introduces a novel data-cube approach that combines sMIM imaging and sMIM point spectroscopy, producing an integrated and complete 3D data set. This approach replaces the subjective approach of guessing locations of interest (for single point spectroscopy) with a big data approach resulting in higher dimensional data that can be sliced along any axis or plane and is conducive to principal component analysis or other machine learning approaches to data reduction. The data-cube approach is also applicable to other AFM-based electrical characterization modes.


2021 ◽  
Author(s):  
Bobin Ning ◽  
Yonggan Xue ◽  
Hongyi Liu ◽  
Hongyu Sun ◽  
Baoqing Jia

Abstract Although substantial achievements in the tumor microenvironment (TME) of hepatocellular carcinoma (HCC) have led to fundamental improvements both in the basic research and clinical management, the potential mechanisms and regulatory relationships between m6A regulators and the TME are still unknown. We first conducted unsupervised clustering on the samples according to the core m6A expression, and then compared the signaling pathways, differential genes (DEGs), and TME between the m6A phenotypes, and re-validated the relationship between m6A regulators and TME by single cell sequencing. Then, the geneCluster was obtained by another unsupervised clustering of the DEGs, and the clinical as well as TME traits were evaluated among the geneClusters. Finally, the m6A scores of individual patients were calculated by principal component analysis (PCA) to verify the correlation from multiple perspectives, including survivals, clinical characters, mutations, TME, immunotherapy, and chemotherapy. Through a comprehensive analysis of 729 samples, we classified HCC patients into three m6A clusters and three geneClusters. Each group exhibited remarkable variations in terms of signaling pathways, clinical traits, and survival expectations. Notably, the m6A phenotypes corresponded to three different types of TME, namely immune-inflamed, immune-excluded, and immune-desert, respectively. In addition, the m6A regulator can accurately reflect the individualized microenvironment in HCC, and present supreme expression levels in the stromal microenvironment. However, the m6A score system is able to make accurate predictions not only in terms of clinical traits, survival prediction, and TME mentioned above, but also in the sensitivity of HCC patients to immunotherapy and chemotherapy. This study revealed the uniqueness and pluripotency of m6A regulators in the TME of HCC by combining single-cell sequencing and bulk sequencing. The quantified m6A modification indices were able to accurately predict patient survival expectations, clinical traits, TME, and sensitivity to immunotherapy and chemotherapy.


Sensors ◽  
2021 ◽  
Vol 21 (23) ◽  
pp. 8017
Author(s):  
Nurfazrina M. Zamry ◽  
Anazida Zainal ◽  
Murad A. Rassam ◽  
Eman H. Alkhammash ◽  
Fuad A. Ghaleb ◽  
...  

Wireless Sensors Networks have been the focus of significant attention from research and development due to their applications of collecting data from various fields such as smart cities, power grids, transportation systems, medical sectors, military, and rural areas. Accurate and reliable measurements for insightful data analysis and decision-making are the ultimate goals of sensor networks for critical domains. However, the raw data collected by WSNs usually are not reliable and inaccurate due to the imperfect nature of WSNs. Identifying misbehaviours or anomalies in the network is important for providing reliable and secure functioning of the network. However, due to resource constraints, a lightweight detection scheme is a major design challenge in sensor networks. This paper aims at designing and developing a lightweight anomaly detection scheme to improve efficiency in terms of reducing the computational complexity and communication and improving memory utilization overhead while maintaining high accuracy. To achieve this aim, one-class learning and dimension reduction concepts were used in the design. The One-Class Support Vector Machine (OCSVM) with hyper-ellipsoid variance was used for anomaly detection due to its advantage in classifying unlabelled and multivariate data. Various One-Class Support Vector Machine formulations have been investigated and Centred-Ellipsoid has been adopted in this study due to its effectiveness. Centred-Ellipsoid is the most effective kernel among studies formulations. To decrease the computational complexity and improve memory utilization, the dimensions of the data were reduced using the Candid Covariance-Free Incremental Principal Component Analysis (CCIPCA) algorithm. Extensive experiments were conducted to evaluate the proposed lightweight anomaly detection scheme. Results in terms of detection accuracy, memory utilization, computational complexity, and communication overhead show that the proposed scheme is effective and efficient compared few existing schemes evaluated. The proposed anomaly detection scheme achieved the accuracy higher than 98%, with (𝑛𝑑) memory utilization and no communication overhead.


Author(s):  
Mohsen Moshki ◽  
Mehran Garmehi ◽  
Peyman Kabiri

In this chapter, application of Principal Component Analysis (PCA) and one of its extensions on intrusion detection is investigated. This extended version of PCA is modified to cover an important shortcoming of traditional PCA. In order to evaluate these modifications, it is mathematically proved that these modifications are beneficial and later on a known dataset such as the DARPA99 dataset is used to verify results experimentally. To verify this approach, initially the traditional PCA is used to preprocess the dataset. Later on, using a simple classifier such as KNN, the effectiveness of the multiclass classification is studied. In the reported work, instead of traditional PCA, a revised version of PCA named Weighted PCA (WPCA) will be used for feature extraction. The results from applying the aforementioned method to the DARPA99 dataset show that this approach results in better accuracy than the traditional PCA when a number of features are limited, a number of classes are large, and a population of classes is unbalanced. In some situations WPCA outperforms traditional PCA by more than 1% in accuracy.


Sign in / Sign up

Export Citation Format

Share Document