scholarly journals Stacked Generalization: An Introduction to Super Learning

2017 ◽  
Author(s):  
Ashley I. Naimi ◽  
Laura B. Balzer

AbstractStacked generalization is an ensemble method that allows researchers to combine several different prediction algorithms into one. Since its introduction in the early 1990s, the method has evolved several times into what is now known as “Super Learner”. Super Learner uses V -fold cross-validation to build the optimal weighted combination of predictions from a library of candidate algorithms. Optimality is defined by a user-specified objective function, such as minimizing mean squared error or maximizing the area under the receiver operating characteristic curve. Although relatively simple in nature, use of the Super Learner by epidemiologists has been hampered by limitations in understanding conceptual and technical details. We work step-by-step through two examples to illustrate concepts and address common concerns.

2020 ◽  
Author(s):  
Rafael Massahiro Yassue ◽  
José Felipe Gonzaga Sabadin ◽  
Giovanni Galli ◽  
Filipe Couto Alves ◽  
Roberto Fritsche-Neto

AbstractUsually, the comparison among genomic prediction models is based on validation schemes as Repeated Random Subsampling (RRS) or K-fold cross-validation. Nevertheless, the design of training and validation sets has a high effect on the way and subjectiveness that we compare models. Those procedures cited above have an overlap across replicates that might cause an overestimated estimate and lack of residuals independence due to resampling issues and might cause less accurate results. Furthermore, posthoc tests, such as ANOVA, are not recommended due to assumption unfulfilled regarding residuals independence. Thus, we propose a new way to sample observations to build training and validation sets based on cross-validation alpha-based design (CV-α). The CV-α was meant to create several scenarios of validation (replicates x folds), regardless of the number of treatments. Using CV-α, the number of genotypes in the same fold across replicates was much lower than K-fold, indicating higher residual independence. Therefore, based on the CV-α results, as proof of concept, via ANOVA, we could compare the proposed methodology to RRS and K-fold, applying four genomic prediction models with a simulated and real dataset. Concerning the predictive ability and bias, all validation methods showed similar performance. However, regarding the mean squared error and coefficient of variation, the CV-α method presented the best performance under the evaluated scenarios. Moreover, as it has no additional cost nor complexity, it is more reliable and allows the use of non-subjective methods to compare models and factors. Therefore, CV-α can be considered a more precise validation methodology for model selection.


Author(s):  
Mark J. van der Laan ◽  
Eric C Polley ◽  
Alan E. Hubbard

When trying to learn a model for the prediction of an outcome given a set of covariates, a statistician has many estimation procedures in their toolbox. A few examples of these candidate learners are: least squares, least angle regression, random forests, and spline regression. Previous articles (van der Laan and Dudoit (2003); van der Laan et al. (2006); Sinisi et al. (2007)) theoretically validated the use of cross validation to select an optimal learner among many candidate learners. Motivated by this use of cross validation, we propose a new prediction method for creating a weighted combination of many candidate learners to build the super learner. This article proposes a fast algorithm for constructing a super learner in prediction which uses V-fold cross-validation to select weights to combine an initial set of candidate learners. In addition, this paper contains a practical demonstration of the adaptivity of this so called super learner to various true data generating distributions. This approach for construction of a super learner generalizes to any parameter which can be defined as a minimizer of a loss function.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 113 ◽  
Author(s):  
Marcel Baltruschat ◽  
Paul Czodrowski

We present a small molecule pKa prediction tool entirely written in Python. It predicts the macroscopic pKa value and is trained on a literature compilation of monoprotic compounds. Different machine learning models were tested and random forest performed best given a five-fold cross-validation (mean absolute error=0.682, root mean squared error=1.032, correlation coefficient r2 =0.82). We test our model on two external validation sets, where our model performs comparable to Marvin and is better than a recently published open source model. Our Python tool and all data is freely available at https://github.com/czodrowskilab/Machine-learning-meets-pKa.


Author(s):  
Yu Zhang ◽  
Cangzhi Jia ◽  
Chee Keong Kwoh

Abstract Long noncoding RNAs (lncRNAs) play significant roles in various physiological and pathological processes via their interactions with biomolecules like DNA, RNA and protein. The existing in silico methods used for predicting the functions of lncRNA mainly rely on calculating the similarity of lncRNA or investigating whether an lncRNA can interact with a specific biomolecule or disease. In this work, we explored the functions of lncRNA from a different perspective: we presented a tool for predicting the interaction biomolecule type for a given lncRNA. For this purpose, we first investigated the main molecular mechanisms of the interactions of lncRNA–RNA, lncRNA–protein and lncRNA–DNA. Then, we developed an ensemble deep learning model: lncIBTP (lncRNA Interaction Biomolecule Type Prediction). This model predicted the interactions between lncRNA and different types of biomolecules. On the 5-fold cross-validation, the lncIBTP achieves average values of 0.7042 in accuracy, 0.7903 and 0.6421 in macro-average area under receiver operating characteristic curve and precision–recall curve, respectively, which illustrates the model effectiveness. Besides, based on the analysis of the collected published data and prediction results, we hypothesized that the characteristics of lncRNAs that interacted with DNA may be different from those that interacted with only RNA.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 113
Author(s):  
Marcel Baltruschat ◽  
Paul Czodrowski

We present a small molecule pKa prediction tool entirely written in Python. It predicts the macroscopic pKa value and is trained on a literature compilation of monoprotic compounds. Different machine learning models were tested and random forest performed best given a five-fold cross-validation (mean absolute error=0.682, root mean squared error=1.032, correlation coefficient r2 =0.82). We test our model on two external validation sets, where our model performs comparable to Marvin and is better than a recently published open source model. Our Python tool and all data is freely available at https://github.com/czodrowskilab/Machine-learning-meets-pKa.


2021 ◽  
Vol 22 (24) ◽  
pp. 13607
Author(s):  
Zhou Huang ◽  
Yu Han ◽  
Leibo Liu ◽  
Qinghua Cui ◽  
Yuan Zhou

MicroRNAs (miRNAs) are associated with various complex human diseases and some miRNAs can be directly involved in the mechanisms of disease. Identifying disease-causative miRNAs can provide novel insight in disease pathogenesis from a miRNA perspective and facilitate disease treatment. To date, various computational models have been developed to predict general miRNA–disease associations, but few models are available to further prioritize causal miRNA–disease associations from non-causal associations. Therefore, in this study, we constructed a Levenshtein-Distance-Enhanced miRNA–Disease Causal Association Predictor (LE-MDCAP), to predict potential causal miRNA–disease associations. Specifically, Levenshtein distance matrixes covering the sequence, expression and functional miRNA similarities were introduced to enhance the previous Gaussian interaction profile kernel-based similarity matrix. LE-MDCAP integrated miRNA similarity matrices, disease semantic similarity matrix and known causal miRNA–disease associations to make predictions. For regular causal vs. non-disease association discrimination task, LF-MDCAP achieved area under the receiver operating characteristic curve (AUROC) of 0.911 and 0.906 in 10-fold cross-validation and independent test, respectively. More importantly, LE-MDCAP prominently outperformed the previous MDCAP model in distinguishing causal versus non-causal miRNA–disease associations (AUROC 0.820 vs. 0.695). Case studies performed on diabetic retinopathy and hsa-mir-361 also validated the accuracy of our model. In summary, LE-MDCAP could be useful for screening causal miRNA–disease associations from general miRNA–disease associations.


Author(s):  
Felipe Guimarães Teixeira ◽  
Paulo Tadeu Cardozo Ribeiro Rosa ◽  
Roger Gomes Tavares Mello ◽  
Jurandir Nadal

Purpose: The study aimed to identify the variables that differentiate judo athletes at national and regional levels. Multivariable analysis was applied to biomechanical, anthropometric, and Special Judo Fitness Test (SJFT) data. Method: Forty-two male judo athletes from 2 competitive groups (14 national and 28 state levels) performed the following measurements and tests: (1) skinfold thickness, (2) circumference, (3) bone width, (4) longitudinal length, (5) stabilometric tests, (6) dynamometric tests, and (7) SJFT. The variables with significant differences in the Wilcoxon rank-sum test were used in stepwise logistic regression to select those that better separate the groups. The authors considered models with a maximum of 3 variables to avoid overfitting. They used 7-fold cross validation to calculate optimism-corrected measures of model performance. Results: The 3 variables that best differentiated the groups were the epicondylar humerus width, the total number of throws on the SJFT, and the stabilometric mean velocity of the center of pressure in the mediolateral direction. The area under the receiver-operating-characteristic curve for the model (based on 7-fold cross validation) was 0.95. Conclusion: This study suggests that a reduced set of anthropometric, biomechanical, and SJFT variables can differentiate judo athlete’s levels.


2019 ◽  
Vol 35 (23) ◽  
pp. 4922-4929 ◽  
Author(s):  
Zhao-Chun Xu ◽  
Peng-Mian Feng ◽  
Hui Yang ◽  
Wang-Ren Qiu ◽  
Wei Chen ◽  
...  

Abstract Motivation Dihydrouridine (D) is a common RNA post-transcriptional modification found in eukaryotes, bacteria and a few archaea. The modification can promote the conformational flexibility of individual nucleotide bases. And its levels are increased in cancerous tissues. Therefore, it is necessary to detect D in RNA for further understanding its functional roles. Since wet-experimental techniques for the aim are time-consuming and laborious, it is urgent to develop computational models to identify D modification sites in RNA. Results We constructed a predictor, called iRNAD, for identifying D modification sites in RNA sequence. In this predictor, the RNA samples derived from five species were encoded by nucleotide chemical property and nucleotide density. Support vector machine was utilized to perform the classification. The final model could produce the overall accuracy of 96.18% with the area under the receiver operating characteristic curve of 0.9839 in jackknife cross-validation test. Furthermore, we performed a series of validations from several aspects and demonstrated the robustness and reliability of the proposed model. Availability and implementation A user-friendly web-server called iRNAD can be freely accessible at http://lin-group.cn/server/iRNAD, which will provide convenience and guide to users for further studying D modification.


2005 ◽  
Vol 22 (2) ◽  
pp. 198-206 ◽  
Author(s):  
Phillip C. Usera ◽  
John T. Foley ◽  
Joonkoo Yun

The purpose of this study was to cross-validate skinfold and anthropometric measurements for individuals with Down syndrome (DS). Estimated body fat of 14 individuals with DS and 13 individuals without DS was compared between criterion measurement (BOP POD®) and three prediction equations. Correlations between criterion and field-based tests for non-DS group and DS groups ranged from .81 – .94 and .11 – .54, respectively. Root-Mean-Squared-Error was employed to examine the amount of error on the field-based measurements. A MANOVA indicated significant differences in accuracy between groups for Jackson’s equation and Lohman’s equation. Based on the results, efforts should now be directed toward developing new equations that can assess the body composition of individuals with DS in a clinically feasible way.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Yemei Liu ◽  
Pei Yang ◽  
Yong Pi ◽  
Lisha Jiang ◽  
Xiao Zhong ◽  
...  

Abstract Background We aimed to construct an artificial intelligence (AI) guided identification of suspicious bone metastatic lesions from the whole-body bone scintigraphy (WBS) images by convolutional neural networks (CNNs). Methods We retrospectively collected the 99mTc-MDP WBS images with confirmed bone lesions from 3352 patients with malignancy. 14,972 bone lesions were delineated manually by physicians and annotated as benign and malignant. The lesion-based differentiating performance of the proposed network was evaluated by fivefold cross validation, and compared with the other three popular CNN architectures for medical imaging. The average sensitivity, specificity, accuracy and the area under receiver operating characteristic curve (AUC) were calculated. To delve the outcomes of this study, we conducted subgroup analyses, including lesion burden number and tumor type for the classifying ability of the CNN. Results In the fivefold cross validation, our proposed network reached the best average accuracy (81.23%) in identifying suspicious bone lesions compared with InceptionV3 (80.61%), VGG16 (81.13%) and DenseNet169 (76.71%). Additionally, the CNN model's lesion-based average sensitivity and specificity were 81.30% and 81.14%, respectively. Based on the lesion burden numbers of each image, the area under the receiver operating characteristic curve (AUC) was 0.847 in the few group (lesion number n ≤ 3), 0.838 in the medium group (n = 4–6), and 0.862 in the extensive group (n > 6). For the three major primary tumor types, the CNN-based lesion identifying AUC value was 0.870 for lung cancer, 0.900 for prostate cancer, and 0.899 for breast cancer. Conclusion The CNN model suggests potential in identifying suspicious benign and malignant bone lesions from whole-body bone scintigraphic images.


Sign in / Sign up

Export Citation Format

Share Document