Comparing Boosting and Bagging for Decision Trees of Rankings

Evolutionary Algorithm for Improving Decision Tree with Global Discretization in Manufacturing

Sensors ◽

10.3390/s21082849 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2849

Author(s):

Sungbum Jun

Keyword(s):

Decision Tree ◽

Evolutionary Algorithm ◽

Decision Trees ◽

Manufacturing Systems ◽

Ensemble Methods ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Industrial Internet ◽

Tree Models ◽

Real World Datasets

Due to the recent advance in the industrial Internet of Things (IoT) in manufacturing, the vast amount of data from sensors has triggered the need for leveraging such big data for fault detection. In particular, interpretable machine learning techniques, such as tree-based algorithms, have drawn attention to the need to implement reliable manufacturing systems, and identify the root causes of faults. However, despite the high interpretability of decision trees, tree-based models make a trade-off between accuracy and interpretability. In order to improve the tree’s performance while maintaining its interpretability, an evolutionary algorithm for discretization of multiple attributes, called Decision tree Improved by Multiple sPLits with Evolutionary algorithm for Discretization (DIMPLED), is proposed. The experimental results with two real-world datasets from sensors showed that the decision tree improved by DIMPLED outperformed the performances of single-decision-tree models (C4.5 and CART) that are widely used in practice, and it proved competitive compared to the ensemble methods, which have multiple decision trees. Even though the ensemble methods could produce slightly better performances, the proposed DIMPLED has a more interpretable structure, while maintaining an appropriate performance level.

Download Full-text

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions.

10.21203/rs.3.rs-45186/v3 ◽

2021 ◽

Author(s):

Alena Orlenko ◽

Jason H Moore

Keyword(s):

Random Forest ◽

Genetic Association ◽

Simulated Data ◽

Machine Learning Algorithms ◽

Theory Approach ◽

Rank Estimation ◽

Feature Importance ◽

Random Forest Models ◽

Additive Interactions ◽

Real World Datasets

Abstract Background: Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer’s, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model’s performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. Results: To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. Conclusions: By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Download Full-text

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions

BioData Mining ◽

10.1186/s13040-021-00243-0 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Alena Orlenko ◽

Jason H. Moore

Keyword(s):

Random Forest ◽

Genetic Association ◽

Simulated Data ◽

Machine Learning Algorithms ◽

Theory Approach ◽

Rank Estimation ◽

Feature Importance ◽

Random Forest Models ◽

Additive Interactions ◽

Real World Datasets

Abstract Background Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer’s, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model’s performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. Results To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. Conclusions By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Download Full-text

A comparison of methods for interpreting random forest models of genetic association in the presence of non-additive interactions.

10.21203/rs.3.rs-45186/v2 ◽

2020 ◽

Author(s):

Alena Orlenko ◽

Jason H Moore

Keyword(s):

Random Forest ◽

Genetic Association ◽

Simulated Data ◽

Machine Learning Algorithms ◽

Theory Approach ◽

Rank Estimation ◽

Feature Importance ◽

Random Forest Models ◽

Additive Interactions ◽

Real World Datasets

Abstract Background: Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer’s, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model’s performance and with the Shapely additive explanations which employ cooperative game theory approach. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. Results: To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. Conclusions: By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Download Full-text

Evaluation of domain adaptation approaches for robust classification of heterogeneous biological data sets

10.1101/682997 ◽

2019 ◽

Author(s):

Michael Schneider ◽

Lichao Wang ◽

Carsten Marr

Keyword(s):

Cell Cycle ◽

Domain Adaptation ◽

Negative Impact ◽

Simulated Data ◽

Heterogeneous Data ◽

Biological Data ◽

Machine Learning Algorithms ◽

Training Data ◽

Data Sets ◽

Biological Studies

AbstractMost machine learning algorithms require that training data are identically distributed to ensure effective learning. In biological studies, however, even small variations in the experimental setup can lead to substantial deviations. Domain adaptation offers tools to deal with this problem. It is particularly useful for cases where only a small amount of training data is available in the domain of interest, while a large amount of training data is available in a different, but relevant domain.We investigated to what extent domain adaptation was able to improve prediction accuracy for complex biological data. To that end, we used simulated data and time-lapse movies of differentiating blood stem cells in different cell cycle stages from multiple experiments and compared three commonly used domain adaptation approaches. EasyAdapt, a simple technique of structured pooling of related data sets, was able to improve accuracy when classifying the simulated data and cell cycle stages from microscopic images. Meanwhile, the technique proved robust to the potential negative impact on the classification accuracy that is common in other techniques that build models with heterogeneous data. Despite its implementation simplicity, EasyAdapt consistently produced more accurate predictions compared to conventional techniques.Domain adaptation is therefore able to substantially reduce the amount of work required to create a large amount of annotated training data in the domain of interest necessary whenever the domain changes even a little, which is common not only in biological experiments, but universally exists in almost all data collection routines.

Download Full-text

Improving the Interpretability of Random Forest Models of Genetic Association in the Presence of Non-Additive Interactions

10.21203/rs.3.rs-29752/v1 ◽

2020 ◽

Author(s):

Alena Orlenko ◽

Jason H Moore

Keyword(s):

Random Forest ◽

Genetic Association ◽

Simulated Data ◽

Machine Learning Algorithms ◽

Rank Estimation ◽

Feature Importance ◽

Random Forest Models ◽

Additive Interactions ◽

Real World Datasets ◽

Future Direction

Abstract Background Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer’s, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model’s performance. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis.Results To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics evaluations for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions.Conclusions By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Download Full-text

Fake News Detection Using Machine Learning Ensemble Methods

Complexity ◽

10.1155/2020/8885861 ◽

2020 ◽

Vol 2020 ◽

pp. 1-11

Author(s):

Iftikhar Ahmad ◽

Muhammad Yousaf ◽

Suhail Yousaf ◽

Muhammad Ovais Ahmad

Keyword(s):

Machine Learning ◽

Social Media ◽

Information Dissemination ◽

Ensemble Methods ◽

Machine Learning Algorithms ◽

Superior Performance ◽

Automated Classification ◽

Social Media Platforms ◽

Real World Datasets

The advent of the World Wide Web and the rapid adoption of social media platforms (such as Facebook and Twitter) paved the way for information dissemination that has never been witnessed in the human history before. With the current usage of social media platforms, consumers are creating and sharing more information than ever before, some of which are misleading with no relevance to reality. Automated classification of a text article as misinformation or disinformation is a challenging task. Even an expert in a particular domain has to explore multiple aspects before giving a verdict on the truthfulness of an article. In this work, we propose to use machine learning ensemble approach for automated classification of news articles. Our study explores different textual properties that can be used to distinguish fake contents from real. By using those properties, we train a combination of different machine learning algorithms using various ensemble methods and evaluate their performance on 4 real world datasets. Experimental evaluation confirms the superior performance of our proposed ensemble learner approach in comparison to individual learners.

Download Full-text

Improving the interpretability of random forest models of genetic association in the presence of non-additive interactions

10.21203/rs.3.rs-45186/v1 ◽

2020 ◽

Author(s):

Alena Orlenko ◽

Jason H Moore

Keyword(s):

Random Forest ◽

Genetic Association ◽

Simulated Data ◽

Machine Learning Algorithms ◽

Rank Estimation ◽

Feature Importance ◽

Random Forest Models ◽

Additive Interactions ◽

Real World Datasets ◽

Future Direction

Abstract Background: Non-additive interactions among genes are frequently associated with a number of phenotypes, including known complex diseases such as Alzheimer’s, diabetes, and cardiovascular disease. Detecting interactions requires careful selection of analytical methods, and some machine learning algorithms are unable or underpowered to detect or model feature interactions that exhibit non-additivity. The Random Forest method is often employed in these efforts due to its ability to detect and model non-additive interactions. In addition, Random Forest has the built-in ability to estimate feature importance scores, a characteristic that allows the model to be interpreted with the order and effect size of the feature association with the outcome. This characteristic is very important for epidemiological and clinical studies where results of predictive modeling could be used to define the future direction of the research efforts. An alternative way to interpret the model is with a permutation feature importance metric which employs a permutation approach to calculate a feature contribution coefficient in units of the decrease in the model’s performance. Currently, it is unclear which Random Forest feature importance metric provides a superior estimation of the true informative contribution of features in genetic association analysis. Results: To address this issue, and to improve interpretability of Random Forest predictions, we compared different methods for feature importance estimation in real and simulated datasets with non-additive interactions. As a result, we detected a discrepancy between the metrics evaluations for the real-world datasets and further established that the permutation feature importance metric provides more precise feature importance rank estimation for the simulated datasets with non-additive interactions. Conclusions: By analyzing both real and simulated data, we established that the permutation feature importance metric provides more precise feature importance rank estimation in the presence of non-additive interactions.

Download Full-text

Optimization of Diabetes Training DATA using Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i2.283286 ◽

2018 ◽

Vol 6 (2) ◽

pp. 283-286

Author(s):

M. Samba Siva Rao ◽

◽

M.Yaswanth . ◽

K. Raghavendra Swamy ◽

◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data

Download Full-text

Landslide Susceptibility Mapping Using Rotation Forest Ensemble Technique with Different Decision Trees in the Three Gorges Reservoir Area, China

Remote Sensing ◽

10.3390/rs13020238 ◽

2021 ◽

Vol 13 (2) ◽

pp. 238

Author(s):

Zhice Fang ◽

Yi Wang ◽

Gonghao Duan ◽

Ling Peng

Keyword(s):

Decision Trees ◽

Landslide Susceptibility ◽

Ensemble Methods ◽

Landslide Susceptibility Mapping ◽

Three Gorges Reservoir Area ◽

Ratio Method ◽

Susceptibility Map ◽

Rotation Forest ◽

Predictive Values ◽

Ensemble Technique

This study presents a new ensemble framework to predict landslide susceptibility by integrating decision trees (DTs) with the rotation forest (RF) ensemble technique. The proposed framework mainly includes four steps. First, training and validation sets are randomly selected according to historical landslide locations. Then, landslide conditioning factors are selected and screened by the gain ratio method. Next, several training subsets are produced from the training set and a series of trained DTs are obtained by using a DT as a base classifier couple with different training subsets. Finally, the resultant landslide susceptibility map is produced by combining all the DT classification results using the RF ensemble technique. Experimental results demonstrate that the performance of all the DTs can be effectively improved by integrating them with the RF ensemble technique. Specifically, the proposed ensemble methods achieved the predictive values of 0.012–0.121 higher than the DTs in terms of area under the curve (AUC). Furthermore, the proposed ensemble methods are better than the most popular ensemble methods with the predictive values of 0.005–0.083 in terms of AUC. Therefore, the proposed ensemble framework is effective to further improve the spatial prediction of landslides.

Download Full-text