0294 Assessing genomic prediction accuracy for Holstein sires using bootstrap aggregation sampling and leave-one-out cross validation

A. Mikshowsky; K. A. Weigel; D. Gianola

doi:10.2527/jam2016-0294

Assessing genomic prediction accuracy for Holstein sires using bootstrap aggregation sampling and leave-one-out cross validation

Journal of Dairy Science ◽

10.3168/jds.2016-11496 ◽

2017 ◽

Vol 100 (1) ◽

pp. 453-464 ◽

Cited By ~ 6

Author(s):

Ashley A. Mikshowsky ◽

Daniel Gianola ◽

Kent A. Weigel

Keyword(s):

Genomic Prediction ◽

Prediction Accuracy ◽

Cross Validation ◽

Leave One Out ◽

Bootstrap Aggregation

Download Full-text

Pitfalls and Remedies for Cross Validation with Multi-trait Genomic Prediction Methods

10.1101/595397 ◽

2019 ◽

Author(s):

Daniel Runcie ◽

Hao Cheng

Keyword(s):

Genomic Prediction ◽

Prediction Accuracy ◽

Cross Validation ◽

Prediction Models ◽

Selection Index ◽

Parametric Method ◽

Multiple Traits ◽

Gold Standard Method ◽

Secondary Traits ◽

Validation Strategy

ABSTRACTIncorporating measurements on correlated traits into genomic prediction models can increase prediction accuracy and selection gain. However, multi-trait genomic prediction models are complex and prone to overfitting which may result in a loss of prediction accuracy relative to single-trait genomic prediction. Cross-validation is considered the gold standard method for selecting and tuning models for genomic prediction in both plant and animal breeding. When used appropriately, cross-validation gives an accurate estimate of the prediction accuracy of a genomic prediction model, and can effectively choose among disparate models based on their expected performance in real data. However, we show that a naive cross-validation strategy applied to the multi-trait prediction problem can be severely biased and lead to sub-optimal choices between single and multi-trait models when secondary traits are used to aid in the prediction of focal traits and these secondary traits are measured on the individuals to be tested. We use simulations to demonstrate the extent of the problem and propose three partial solutions: 1) a parametric solution from selection index theory, 2) a semi-parametric method for correcting the cross-validation estimates of prediction accuracy, and 3) a fully non-parametric method which we call CV2*: validating model predictions against focal trait measurements from genetically related individuals. The current excitement over high-throughput phenotyping suggests that more comprehensive phenotype measurements will be useful for accelerating breeding programs. Using an appropriate cross-validation strategy should more reliably determine if and when combining information across multiple traits is useful.

Download Full-text

Comparison of LDA and SPRT on Clinical Dataset Classifications

Biomedical Informatics Insights ◽

10.4137/bii.s6935 ◽

2011 ◽

Vol 4 ◽

pp. BII.S6935 ◽

Cited By ~ 2

Author(s):

Chih Lee ◽

Brittany Nkounkou ◽

Chun-Hsi Huang

Keyword(s):

Learning Community ◽

Prediction Accuracy ◽

Cross Validation ◽

Error Rates ◽

Close Relative ◽

Classification Error ◽

Class Label ◽

Normality Assumption ◽

Clinical Dataset ◽

Leave One Out

In this work, we investigate the well-known classification algorithm LDA as well as its close relative SPRT. SPRT affords many theoretical advantages over LDA. It allows specification of desired classification error rates α and β and is expected to be faster in predicting the class label of a new instance. However, SPRT is not as widely used as LDA in the pattern recognition and machine learning community. For this reason, we investigate LDA, SPRT and a modified SPRT (MSPRT) empirically using clinical datasets from Parkinson's disease, colon cancer, and breast cancer. We assume the same normality assumption as LDA and propose variants of the two SPRT algorithms based on the order in which the components of an instance are sampled. Leave-one-out cross-validation is used to assess and compare the performance of the methods. The results indicate that two variants, SPRT-ordered and MSPRT-ordered, are superior to LDA in terms of prediction accuracy. Moreover, on average SPRT-ordered and MSPRT-ordered examine less components than LDA before arriving at a decision. These advantages imply that SPRT-ordered and MSPRT-ordered are the preferred algorithms over LDA when the normality assumption can be justified for a dataset.

Download Full-text

How Population Structure Impacts Genomic Selection Accuracy in Cross-Validation: Implications for Practical Breeding

Frontiers in Plant Science ◽

10.3389/fpls.2020.592977 ◽

2020 ◽

Vol 11 ◽

Author(s):

Christian R. Werner ◽

R. Chris Gaynor ◽

Gregor Gorjanc ◽

John M. Hickey ◽

Tobias Kox ◽

...

Keyword(s):

Family Structure ◽

Genomic Selection ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Cross Validation ◽

Careful Analysis ◽

Critical Approach ◽

Crop Species ◽

Breeding Programs ◽

Mendelian Sampling

Over the last two decades, the application of genomic selection has been extensively studied in various crop species, and it has become a common practice to report prediction accuracies using cross validation. However, genomic prediction accuracies obtained from random cross validation can be strongly inflated due to population or family structure, a characteristic shared by many breeding populations. An understanding of the effect of population and family structure on prediction accuracy is essential for the successful application of genomic selection in plant breeding programs. The objective of this study was to make this effect and its implications for practical breeding programs comprehensible for breeders and scientists with a limited background in quantitative genetics and genomic selection theory. We, therefore, compared genomic prediction accuracies obtained from different random cross validation approaches and within-family prediction in three different prediction scenarios. We used a highly structured population of 940 Brassica napus hybrids coming from 46 testcross families and two subpopulations. Our demonstrations show how genomic prediction accuracies obtained from among-family predictions in random cross validation and within-family predictions capture different measures of prediction accuracy. While among-family prediction accuracy measures prediction accuracy of both the parent average component and the Mendelian sampling term, within-family prediction only measures how accurately the Mendelian sampling term can be predicted. With this paper we aim to foster a critical approach to different measures of genomic prediction accuracy and a careful analysis of values observed in genomic selection experiments and reported in literature.

Download Full-text

Optimization of treatment strategy by using a machine learning model to predict survival time of patients with malignant glioma after radiotherapy

Journal of Radiation Research ◽

10.1093/jrr/rrz066 ◽

2019 ◽

Vol 60 (6) ◽

pp. 818-824 ◽

Cited By ~ 2

Author(s):

Takuya Mizutani ◽

Taiki Magome ◽

Hiroshi Igaki ◽

Akihiro Haga ◽

Kanabu Nawa ◽

...

Keyword(s):

Machine Learning ◽

Malignant Glioma ◽

Survival Time ◽

Treatment Duration ◽

Prediction Accuracy ◽

Cross Validation ◽

Learning Model ◽

Machine Learning Model ◽

Prescription Dose ◽

Leave One Out

ABSTRACT The purpose of this study was to predict the survival time of patients with malignant glioma after radiotherapy with high accuracy by considering additional clinical factors and optimize the prescription dose and treatment duration for individual patient by using a machine learning model. A total of 35 patients with malignant glioma were included in this study. The candidate features included 12 clinical features and 192 dose–volume histogram (DVH) features. The appropriate input features and parameters of the support vector machine (SVM) were selected using the genetic algorithm based on Akaike’s information criterion, i.e. clinical, DVH, and both clinical and DVH features. The prediction accuracy of the SVM models was evaluated through a leave-one-out cross-validation test with residual error, which was defined as the absolute difference between the actual and predicted survival times after radiotherapy. Moreover, the influences of various values of prescription dose and treatment duration on the predicted survival time were evaluated. The prediction accuracy was significantly improved with the combined use of clinical and DVH features compared with the separate use of both features (P < 0.01, Wilcoxon signed rank test). Mean ± standard deviation of the leave-one-out cross-validation using the combined clinical and DVH features, only clinical features and only DVH features were 104.7 ± 96.5, 144.2 ± 126.1 and 204.5 ± 186.0 days, respectively. The prediction accuracy could be improved with the combination of clinical and DVH features, and our results show the potential to optimize the treatment strategy for individual patients based on a machine learning model.

Download Full-text

Genome-wide mapping and prediction of plant architecture in a sorghum nested association mapping population

10.1101/2020.01.28.923540 ◽

2020 ◽

Cited By ~ 1

Author(s):

Marcus O. Olatoye ◽

Zhenbin Hu ◽

Geoffrey P. Morris

Keyword(s):

Association Mapping ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Plant Architecture ◽

Cross Validation ◽

Mapping Population ◽

Nested Association Mapping ◽

Nested Association Mapping Population ◽

Association Mapping Population ◽

Fold Cross Validation

AbstractModifying plant architecture is often necessary for yield improvement and climate adaptation, but we lack understanding of the genotype-phenotype map for plant morphology in sorghum. Here, we use a nested association mapping (NAM) population that captures global allelic diversity of sorghum to characterize the genetics of leaf erectness, leaf width (at two stages), and stem diameter. Recombinant inbred lines (n = 2200) were phenotyped in multiple environments (35,200 observations) and joint linkage mapping was performed with ∼93,000 markers. Fifty-four QTL of small to large effect were identified for trait BLUPs (9–16 per trait) each explaining 0.4–4% of variation across the NAM population. While some of these QTL colocalize with sorghum homologs of grass genes [e.g. involved in hormone synthesis (maize spi1), floral transition (SbCN8), and transcriptional regulation of development (rice Ideal plant architecture1)], most QTL did not colocalize with an a priori candidate gene (82%). Genomic prediction accuracy was generally high in five-fold cross-validation (0.65–0.83), and varied from low to high in leave-one-family-out cross-validation (0.04–0.61). The findings provide a foundation to identify the molecular basis of architecture variation in sorghum and establish genomic-enabled breeding for improved plant architecture.Core ideasUnderstanding the genetics of plant architecture could facilitate the development of crop ideotypes for yield and adaptationThe genetics of plant architecture traits was characterized in sorghum using multi-environment phenotyping in a global nested association mapping populationFifty-five quantitative trait loci were identified; some colocalize with homologs of known developmental regulators but most do notGenomic prediction accuracy was consistently high in five-fold cross-validation, but accuracy varied considerably in leave-one-family-out predictions

Download Full-text

Static and non-linguistic quantitative indicators to evaluate Japanese comic dialogues of Manzai

Humor - International Journal of Humor Research ◽

10.1515/humor-2017-0111 ◽

2018 ◽

Vol 31 (1) ◽

pp. 39-64

Author(s):

Tetsuya Maeshiro

Keyword(s):

Prediction Accuracy ◽

Semantic Processing ◽

Cross Validation ◽

National Level ◽

Rank Correlation ◽

Time Sequence ◽

Sequence Information ◽

The Mean ◽

Leave One Out ◽

Quantitative Indicators

AbstractThis paper proposes the use of quantitative indicators to evaluate the comedic success of Japanese “Manzai” performances without using semantic processing or time sequence information. The validity of the proposed indicators was verified by predicting the rankings of the final rounds and decision matches of ten M1 Grand Prix, a national-level humor contest in Japan, using leave-one-out cross validation. The results demonstrate that the proposed indicators are able to predict the ranking of Manzai championships as the mean prediction precision was 0.58 (rank correlation) for final rounds, and 0.70 (champion prediction accuracy) for the decision matches.

Download Full-text

Estimation of suspended sediment and dissolved solid load in a Mediterranean semiarid karst stream using log-linear models

Hydrology Research ◽

10.2166/nh.2018.062 ◽

2018 ◽

Vol 50 (1) ◽

pp. 43-59 ◽

Cited By ~ 3

Author(s):

Alberto Martínez-Salvador ◽

Carmelo Conesa-García

Keyword(s):

Sediment Transport ◽

Water Level ◽

Prediction Accuracy ◽

Cross Validation ◽

Linear Models ◽

Base Flow ◽

Solid Load ◽

Log Linear ◽

Leave One Out ◽

Southeast Spain

Abstract Many models have been developed to predict the sediment transport in watercourses. This paper attempts to test the effectiveness of log-linear models (LLM) to estimate the suspended (S-LMM), dissolved (D-LLM), and total suspended (T-LLM) load into a Mediterranean semiarid karst stream (the Argos River basin, in southeast Spain). An assessment of the supposed validity of each model and a leave-one-out cross-validation were carried out to determine their degree of statistical robustness. The T-LLM model showed higher prediction accuracy (R2 = 0.98, RMSE = 0.15, and PE = ±5.4–6.6%) than the D-LLM model (R2 = 0.97, RMSE = 0.16, and PE = ±5.5–6.8%) or the D-LLM model (R2 = 0.77, RMSE = 0.71, and PE = ±101–493%). In addition, different model variants, according to two flow patterns (FP1 = base flow and FP2 = rising water level), were developed. The FP2-SLLM model provided a very good fit (R2 = 0.94, RMSE = 0.34, and PE = ±25.3–61.5%), substantially improving the results of the S-LLM model.

Download Full-text

Genomic prediction for malting quality traits in practical barley breeding programs

10.1101/2020.07.30.228007 ◽

2020 ◽

Cited By ~ 1

Author(s):

Pernille Sarup ◽

Vahid Edriss ◽

Nanna Hellum Kristensen ◽

Jens Due Jensen ◽

Jihad Orabi ◽

...

Keyword(s):

Genomic Prediction ◽

Prediction Accuracy ◽

Cross Validation ◽

Spring Barley ◽

Malting Quality ◽

Breeding Cycle ◽

Quality Traits ◽

Training Population ◽

Barley Breeding ◽

Breeding Cycles

AbstractGenomic prediction can be advantageous in barley breeding for traits such as yield and malting quality to increase selection accuracy and minimize expensive phenotyping. In this paper, we investigate the possibilities of genomic selection for malting quality traits using a limited training population. The size of the training population is an important factor in determining the prediction accuracy of a trait. We investigated the potential for genomic prediction of malting quality within breeding cycles with leave one out (LOO) cross-validation, and across breeding cycles with leave set out (LSO) cross-validation. In addition, we investigated the effect of training population size on prediction accuracy by random two, four, and ten-fold cross-validation. The material used in this study was a population of 1329 spring barley lines from four breeding cycles. We found medium to high narrow sense heritabilities of the malting traits (0.31 to 0.65). Accuracies of predicting breeding values from LOO tests ranged from 0.6 to 0.9 making it worth the effort to use genomic prediction within breeding cycles. Accuracies from LSO tests ranged from 0.39 to 0.70 showing that genomic prediction across the breeding cycles were possible as well. Accuracy of prediction increased when the size of the training population increased. Therefore, prediction accuracy might be increased both within and across breeding cycle by increasing size of the training population

Download Full-text

Identifying human microRNA–disease associations by a new diffusion-based method

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720015500146 ◽

2015 ◽

Vol 13 (04) ◽

pp. 1550014 ◽

Cited By ~ 10

Author(s):

Bo Liao ◽

Sumei Ding ◽

Haowen Chen ◽

Zejun Li ◽

Lijun Cai

Keyword(s):

Biomedical Research ◽

Prediction Accuracy ◽

Information Sources ◽

Cross Validation ◽

Disease Association ◽

Global Network ◽

Disease Similarity ◽

Disease Associations ◽

Network Similarity ◽

Leave One Out

Identifying the microRNA–disease relationship is vital for investigating the pathogenesis of various diseases. However, experimental verification of disease-related microRNAs remains considerable challenge to many researchers, particularly for the fact that numerous new microRNAs are discovered every year. As such, development of computational methods for disease-related microRNA prediction has recently gained eminent attention. In this paper, first, we construct a miRNA functional network and a disease similarity network by integrating different information sources. Then, we further introduce a new diffusion-based method (NDBM) to explore global network similarity for miRNA–disease association inference. Even though known miRNA–disease associations in the database are rare, NDBM still achieves an area under the ROC curve (AUC) of 85.62% in the leave-one-out cross-validation in improving the prediction accuracy of previous methods significantly. Moreover, our method is applicable to diseases with no known related miRNAs as well as new miRNAs with unknown target diseases. Some associations who strongly predicted by our method are confirmed by public databases. These superior performances suggest that NDBM could be an effective and important tool for biomedical research.

Download Full-text