scholarly journals Making the most of Clumping and Thresholding for polygenic scores

2019 ◽  
Author(s):  
Florian Privé ◽  
Bjarni J. Vilhjálmsson ◽  
Hugues Aschard ◽  
Michael G.B. Blum

AbstractPolygenic prediction has the potential to contribute to precision medicine. Clumping and Thresh-olding (C+T) is a widely used method to derive polygenic scores. When using C+T, it is common to test several p-value thresholds to maximize predictive ability of the derived polygenic scores. Along with this p-value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T polygenic scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123,200 different C+T scores for 300K individuals and 1M variants on a single node with 16 cores.We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p-value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: [0.544-0.569]) when tuning only the p-value threshold in C+T to an AUC of 0.592 (95% CI: [0.580-0.604]) when tuning all four hyper-parameters we propose for C+T.We further propose Stacked Clumping and Thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to 8 different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.

Author(s):  
Jiwoo Lee ◽  
Tuomo Kiiskinen ◽  
Nina Mars ◽  
Sakari Jukarainen ◽  
Erik Ingelsson ◽  
...  

Background - Acute coronary syndrome (ACS) is a clinically significant presentation of coronary heart disease (CHD). Genetic information has been proposed to improve prediction beyond well-established clinical risk factors. While polygenic scores (PS) can capture an individual's genetic risk for ACS, its prediction performance may vary in the context of diverse correlated clinical conditions. Here, we aimed to test whether clinical conditions impact the association between PS and ACS. Methods - We explored the association between 405 clinical conditions diagnosed before baseline and 9,080 incident cases of ACS in 387,832 individuals from the UK Biobank. Results were replicated in 6,430 incident cases of ACS in 177,876 individuals from FinnGen. Results - We identified 80 conventional (e.g., stable angina pectoris (SAP), type 2 diabetes mellitus) and unconventional (e.g., diaphragmatic hernia, inguinal hernia) associations with ACS. The association between PS and ACS was consistent in individuals with and without most clinical conditions. However, a diagnosis of SAP yielded a differential association between PS and ACS. PS was associated with a significantly reduced (interaction p-value=2.87×10-8) risk for ACS in individuals with SAP (HR=1.163 [95% CI: 1.082-1.251]) compared to individuals without SAP (HR=1.531 [95% CI: 1.497-1.565]). These findings were replicated in FinnGen (interaction p-value=1.38×10-6). Conclusions - In summary, while most clinical conditions did not impact utility of PS for prediction of ACS, we found that PS was substantially less predictive of ACS in individuals with prevalent stable CHD. PS may be more appropriate for prediction of ACS in asymptomatic individuals than symptomatic individuals with clinical suspicion for CHD.


2018 ◽  
Author(s):  
Florian Privé ◽  
Hugues Aschard ◽  
Michael G.B. Blum

AbstractPolygenic Risk Scores (PRS) consist in combining the information across many single-nucleotide polymorphisms (SNPs) in a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T.In this paper, we present an efficient method to jointly estimate SNP effects, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. The choice of hyper-parameters for a predictive model is very important since it can dramatically impact its predictive performance. As an example, AUC values range from less than 60% to 90% in a model with 30 causal SNPs, depending on the p-value threshold in C+T.We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. PLR consistently achieves higher predictive performance than the two other methods while being as fast as C+T. We find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC of 89% and of 82.5%.In conclusion, our study demonstrates that penalized logistic regression can achieve more discriminative polygenic risk scores, while being applicable to large-scale individual-level data thanks to the implementation we provide in the R package bigstatsr.


Author(s):  
Florian Privé ◽  
Julyan Arbel ◽  
Bjarni J. Vilhjálmsson

AbstractPolygenic scores have become a central tool in human genetics research. LDpred is a popular method for deriving polygenic scores based on summary statistics and a matrix of correlation between genetic variants. However, LDpred has limitations that may reduce its predictive performance. Here we present LDpred2, a new version of LDpred that addresses these issues. We also provide two new options in LDpred2: a “sparse” option that can learn effects that are exactly 0, and an “auto” option that directly learns the two LDpred parameters from data. We benchmark predictive performance of LDpred2 against the previous version on simulated and real data, demonstrating substantial improvements in robustness and predictive accuracy compared to LDpred1. We then show that LDpred2 also outperforms other polygenic score methods recently developed, with a mean AUC over the 8 real traits analyzed here of 65.1%, compared to 63.8% for lassosum, 62.9% for PRS-CS and 61.5% for SBayesR. Note that, in contrast to what was recommended in the first version of this paper, we now recommend to run LDpred2 genome-wide instead of per chromosome. LDpred2 is implemented in R package bigsnpr.


2021 ◽  
Author(s):  
Jiacheng Miao ◽  
Yupei Lin ◽  
Yuchang Wu ◽  
Boyan Zheng ◽  
Lauren L. Schmitz ◽  
...  

Detecting genetic variants associated with the variance of complex traits, i.e. variance quantitative trait loci (vQTL), can provide crucial insights into the interplay between genes and environments and how they jointly shape human phenotypes in the population. We propose a quantile integral linear model (QUAIL) to estimate genetic effects on trait variability. Through extensive simulations and analyses of real data, we demonstrate that QUAIL provides computationally efficient and statistically powerful vQTL mapping that is robust to non-Gaussian phenotypes and confounding effects on phenotypic variability. Applied to UK Biobank (N=375,791), QUAIL identified 11 novel vQTL for body mass index (BMI). Top vQTL findings showed substantial enrichment for interactions with physical activities and sedentary behavior. Further, variance polygenic scores (vPGS) based on QUAIL effect estimates showed superior predictive performance on both population-level and within-individual BMI variability compared to existing approaches. Overall, QUAIL is a unified framework to quantify genetic effects on the phenotypic variability at both single-variant and vPGS levels. It addresses critical limitations in existing approaches and may have broad applications in future gene-environment interaction studies.


Author(s):  
Ying Wang ◽  
Jing Guo ◽  
Guiyan Ni ◽  
Jian Yang ◽  
Peter M. Visscher ◽  
...  

AbstractPolygenic scores (PGS) have been widely used to predict complex traits and risk of diseases using variants identified from genome-wide association studies (GWASs). To date, most GWASs have been conducted in populations of European ancestry, which limits the use of GWAS-derived PGS in non-European populations. Here, we develop a new theory to predict the relative accuracy (RA, relative to the accuracy in populations of the same ancestry as the discovery population) of PGS across ancestries. We used simulations and real data from the UK Biobank to evaluate our results. We found across various simulation scenarios that the RA of PGS based on trait-associated SNPs can be predicted accurately from modelling linkage disequilibrium (LD), minor allele frequencies (MAF), cross-population correlations of SNP effect sizes and heritability. Altogether, we find that LD and MAF differences between ancestries explain alone up to ~70% of the loss of RA using European-based PGS in African ancestry for traits like body mass index and height. Our results suggest that causal variants underlying common genetic variation identified in European ancestry GWASs are mostly shared across continents.


Author(s):  
Florian Privé ◽  
Julyan Arbel ◽  
Bjarni J Vilhjálmsson

Abstract Motivation Polygenic scores have become a central tool in human genetics research. LDpred is a popular method for deriving polygenic scores based on summary statistics and a matrix of correlation between genetic variants. However, LDpred has limitations that may reduce its predictive performance. Results Here, we present LDpred2, a new version of LDpred that addresses these issues. We also provide two new options in LDpred2: a ‘sparse’ option that can learn effects that are exactly 0, and an ‘auto’ option that directly learns the two LDpred parameters from data. We benchmark predictive performance of LDpred2 against the previous version on simulated and real data, demonstrating substantial improvements in robustness and predictive accuracy compared to LDpred1. We then show that LDpred2 also outperforms other polygenic score methods recently developed, with a mean AUC over the 8 real traits analyzed here of 65.1%, compared to 63.8% for lassosum, 62.9% for PRS-CS and 61.5% for SBayesR. Note that LDpred2 provides more accurate polygenic scores when run genome-wide, instead of per chromosome. Availability and implementation LDpred2 is implemented in R package bigsnpr. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Jiwoo Lee ◽  
Tuomo Kiiskinen ◽  
Nina Mars ◽  
Sakari Jukarainen ◽  
Erik Ingelsson ◽  
...  

Early prediction of acute coronary syndrome (ACS) is a major goal for prevention of coronary heart disease (CHD). Genetic information has been proposed to improve prediction beyond well-established clinical risk factors. While polygenic scores (PS) can capture an individual's genetic risk for ACS, its prediction performance may vary in the context of diverse correlated clinical conditions. Here, we aimed to test whether clinical conditions impact the association between PS and ACS. We explored the association between 405 clinical conditions diagnosed before baseline and 9,080 incident cases of ACS in 387,832 individuals from the UK Biobank. We identified 80 conventional (e.g., stable angina pectoris (SAP), type 2 diabetes mellitus) and unconventional (e.g., diaphragmatic hernia, inguinal hernia) associations with ACS. Results were replicated in 6,430 incident cases of ACS in 177,876 individuals from FinnGen. The association between PS and ACS was consistent in individuals with and without most clinical conditions. However, a diagnosis of SAP yielded a differential association between PS and ACS. PS was associated with a significantly reduced (interaction p-value=2.87×10-8) risk for ACS in individuals with SAP (HR=1.163 [95% CI: 1.082-1.251]) compared to individuals without SAP (HR=1.531 [95% CI: 1.497-1.565]). These findings were replicated in FinnGen (interaction p-value=1.38×10-6). In summary, while most clinical conditions did not impact utility of PS for prediction of ACS, we found that PS was substantially less predictive of ACS in individuals with prevalent stable CHD. PS for ACS may be more appropriate for asymptomatic individuals than symptomatic individuals with clinical suspicion for CHD.


2020 ◽  
Author(s):  
John E. McGeary ◽  
Chelsie Benca-Bachman ◽  
Victoria Risner ◽  
Christopher G Beevers ◽  
Brandon Gibb ◽  
...  

Twin studies indicate that 30-40% of the disease liability for depression can be attributed to genetic differences. Here, we assess the explanatory ability of polygenic scores (PGS) based on broad- (PGSBD) and clinical- (PGSMDD) depression summary statistics from the UK Biobank using independent cohorts of adults (N=210; 100% European Ancestry) and children (N=728; 70% European Ancestry) who have been extensively phenotyped for depression and related neurocognitive phenotypes. PGS associations with depression severity and diagnosis were generally modest, and larger in adults than children. Polygenic prediction of depression-related phenotypes was mixed and varied by PGS. Higher PGSBD, in adults, was associated with a higher likelihood of having suicidal ideation, increased brooding and anhedonia, and lower levels of cognitive reappraisal; PGSMDD was positively associated with brooding and negatively related to cognitive reappraisal. Overall, PGS based on both broad and clinical depression phenotypes have modest utility in adult and child samples of depression.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Janhavi R. Raut ◽  
Ben Schöttker ◽  
Bernd Holleczek ◽  
Feng Guo ◽  
Megha Bhardwaj ◽  
...  

AbstractCirculating microRNAs (miRNAs) could improve colorectal cancer (CRC) risk prediction. Here, we derive a blood-based miRNA panel and evaluate its ability to predict CRC occurrence in a population-based cohort of adults aged 50–75 years. Forty-one miRNAs are preselected from independent studies and measured by quantitative-real-time-polymerase-chain-reaction in serum collected at baseline of 198 participants who develop CRC during 14 years of follow-up and 178 randomly selected controls. A 7-miRNA score is derived by logistic regression. Its predictive ability, quantified by the optimism-corrected area-under-the-receiver-operating-characteristic-curve (AUC) using .632+ bootstrap is 0.794. Predictive ability is compared to that of an environmental risk score (ERS) based on known risk factors and a polygenic risk score (PRS) based on 140 previously identified single-nucleotide-polymorphisms. In participants with all scores available, optimism-corrected-AUC is 0.802 for the 7-miRNA score, while AUC (95% CI) is 0.557 (0.498–0.616) for the ERS and 0.622 (0.564–0.681) for the PRS.


2021 ◽  
Vol 11 (15) ◽  
pp. 6998
Author(s):  
Qiuying Li ◽  
Hoang Pham

Many NHPP software reliability growth models (SRGMs) have been proposed to assess software reliability during the past 40 years, but most of them have focused on modeling the fault detection process (FDP) in two ways: one is to ignore the fault correction process (FCP), i.e., faults are assumed to be instantaneously removed after the failure caused by the faults is detected. However, in real software development, it is not always reliable as fault removal usually needs time, i.e., the faults causing failures cannot always be removed at once and the detected failures will become more and more difficult to correct as testing progresses. Another way to model the fault correction process is to consider the time delay between the fault detection and fault correction. The time delay has been assumed to be constant and function dependent on time or random variables following some kind of distribution. In this paper, some useful approaches to the modeling of dual fault detection and correction processes are discussed. The dependencies between fault amounts of dual processes are considered instead of fault correction time-delay. A model aiming to integrate fault-detection processes and fault-correction processes, along with the incorporation of a fault introduction rate and testing coverage rate into the software reliability evaluation is proposed. The model parameters are estimated using the Least Squares Estimation (LSE) method. The descriptive and predictive performance of this proposed model and other existing NHPP SRGMs are investigated by using three real data-sets based on four criteria, respectively. The results show that the new model can be significantly effective in yielding better reliability estimation and prediction.


Sign in / Sign up

Export Citation Format

Share Document