Polygenic scores via penalized regression on summary statistics

Timothy Shin Heng Mak; Robert Milan Porsch; Shing Wan Choi; Xueya Zhou; Pak Chung Sham

doi:10.1002/gepi.22050

Polygenic scores via penalized regression on summary statistics

10.1101/058214 ◽

2016 ◽

Author(s):

Timothy Shin Heng Mak ◽

Robert Milan Porsch ◽

Shing Wan Choi ◽

Xueya Zhou ◽

Pak Chung Sham

Keyword(s):

Prediction Accuracy ◽

Penalized Regression ◽

Tuning Parameter ◽

Summary Statistics ◽

Validation Data ◽

Polygenic Scores ◽

Pertinent Question ◽

Almost All ◽

Risk Categories ◽

General Method

AbstractPolygenic scores (PGS) summarize the genetic contribution of a person’s genotype to a disease or phenotype. They can be used to group participants into different risk categories for diseases, and are also used as covariates in epidemiological analyses. A number of possible ways of calculating polygenic scores have been proposed, and recently there is much interest in methods that incorporate information available in published summary statistics. As there is no inherent information on linkage disequilibrium (LD) in summary statistics, a pertinent question is how we can make use of LD information available elsewhere to supplement such analyses. To answer this question we propose a method for constructing PGS using summary statistics and a reference panel in a penalized regression framework, which we call lassosum. We also propose a general method for choosing the value of the tuning parameter in the absence of validation data. In our simulations, we showed that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value. We also showed that lassosum achieved better prediction accuracy than simple clumping and p-value thresholding in almost all scenarios. It was also substantially faster and more accurate than the recently proposed LDpred.

Download Full-text

Faculty Opinions recommendation of Polygenic scores via penalized regression on summary statistics.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727593680.793563010 ◽

2019 ◽

Author(s):

John Nurnberger

Keyword(s):

Penalized Regression ◽

Summary Statistics ◽

Polygenic Scores

Download Full-text

Penalized regression and model selection methods for polygenic scores on summary statistics

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008271 ◽

2020 ◽

Vol 16 (10) ◽

pp. e1008271

Author(s):

Jack Pattee ◽

Wei Pan

Keyword(s):

Model Selection ◽

Penalized Regression ◽

Summary Statistics ◽

Selection Methods ◽

Polygenic Scores

Download Full-text

Contrasting Broad- and Clinically- defined Polygenic Indicators of Depression and Depression-related Phenotypes in Adults and Children

10.31234/osf.io/pn9vb ◽

2020 ◽

Author(s):

John E. McGeary ◽

Chelsie Benca-Bachman ◽

Victoria Risner ◽

Christopher G Beevers ◽

Brandon Gibb ◽

...

Keyword(s):

Suicidal Ideation ◽

Cognitive Reappraisal ◽

Twin Studies ◽

European Ancestry ◽

Summary Statistics ◽

Depression Severity ◽

Uk Biobank ◽

Polygenic Scores ◽

Adults And Children ◽

The Uk

Twin studies indicate that 30-40% of the disease liability for depression can be attributed to genetic differences. Here, we assess the explanatory ability of polygenic scores (PGS) based on broad- (PGSBD) and clinical- (PGSMDD) depression summary statistics from the UK Biobank using independent cohorts of adults (N=210; 100% European Ancestry) and children (N=728; 70% European Ancestry) who have been extensively phenotyped for depression and related neurocognitive phenotypes. PGS associations with depression severity and diagnosis were generally modest, and larger in adults than children. Polygenic prediction of depression-related phenotypes was mixed and varied by PGS. Higher PGSBD, in adults, was associated with a higher likelihood of having suicidal ideation, increased brooding and anhedonia, and lower levels of cognitive reappraisal; PGSMDD was positively associated with brooding and negatively related to cognitive reappraisal. Overall, PGS based on both broad and clinical depression phenotypes have modest utility in adult and child samples of depression.

Download Full-text

Polygenic scores for UK Biobank scale data

10.1101/252270 ◽

2018 ◽

Cited By ~ 5

Author(s):

Timothy Shin Heng Mak ◽

Robert Milan Porsch ◽

Shing Wan Choi ◽

Pak Chung Sham

Keyword(s):

External Source ◽

Summary Statistics ◽

Uk Biobank ◽

Validation Data ◽

Raw Data ◽

Cross Prediction ◽

Polygenic Scores ◽

The Difference ◽

The Uk ◽

Meta Analyses

AbstractPolygenic scores (PGS) are estimated scores representing the genetic tendency of an individual for a disease or trait and have become an indispensible tool in a variety of analyses. Typically they are linear combination of the genotypes of a large number of SNPs, with the weights calculated from an external source, such as summary statistics from large meta-analyses. Recently cohorts with genetic data have become very large, such that it would be a waste if the raw data were not made use of in constructing PGS. Making use of raw data in calculating PGS, however, presents us with problems of overfitting. Here we discuss the essence of overfitting as applied in PGS calculations and highlight the difference between overfitting due to the overlap between the target and the discovery data (OTD), and overfitting due to the overlap between the target the the validation data (OTV). We propose two methods — cross prediction and split validation — to overcome OTD and OTV respectively. Using these two methods, PGS can be calculated using raw data without overfitting. We show that PGSs thus calculated have better predictive power than those using summary statistics alone for six phenotypes in the UK Biobank data.

Download Full-text

MTAG: Multi-Trait Analysis of GWAS

10.1101/118810 ◽

2017 ◽

Cited By ~ 19

Author(s):

Patrick Turley ◽

Raymond K. Walters ◽

Omeed Maghzian ◽

Aysu Okbay ◽

James J. Lee ◽

...

Keyword(s):

Depressive Symptoms ◽

Well Being ◽

Joint Analysis ◽

Summary Statistics ◽

Subjective Well Being ◽

Bioinformatics Analyses ◽

Trait Analysis ◽

Genome Wide ◽

Polygenic Scores ◽

Variance Explained

ABSTRACTWe introduce Multi-Trait Analysis of GWAS (MTAG), a method for joint analysis of summary statistics from GWASs of different traits, possibly from overlapping samples. We apply MTAG to summary statistics for depressive symptoms (Neff = 354,862), neuroticism (N = 168,105), and subjective well-being (N = 388,538). Compared to 32, 9, and 13 genome-wide significant loci in the single-trait GWASs (most of which are themselves novel), MTAG increases the number of loci to 64, 37, and 49, respectively. Moreover, association statistics from MTAG yield more informative bioinformatics analyses and increase variance explained by polygenic scores by approximately 25%, matching theoretical expectations.

Download Full-text

Making the most of Clumping and Thresholding for polygenic scores

10.1101/653204 ◽

2019 ◽

Cited By ~ 1

Author(s):

Florian Privé ◽

Bjarni J. Vilhjálmsson ◽

Hugues Aschard ◽

Michael G.B. Blum

Keyword(s):

Predictive Ability ◽

Predictive Performance ◽

Real Data ◽

Penalized Regression ◽

P Value ◽

Single Node ◽

Depression Status ◽

Polygenic Scores ◽

Optimal Linear ◽

The Uk

AbstractPolygenic prediction has the potential to contribute to precision medicine. Clumping and Thresh-olding (C+T) is a widely used method to derive polygenic scores. When using C+T, it is common to test several p-value thresholds to maximize predictive ability of the derived polygenic scores. Along with this p-value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T polygenic scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123,200 different C+T scores for 300K individuals and 1M variants on a single node with 16 cores.We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p-value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: [0.544-0.569]) when tuning only the p-value threshold in C+T to an AUC of 0.592 (95% CI: [0.580-0.604]) when tuning all four hyper-parameters we propose for C+T.We further propose Stacked Clumping and Thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to 8 different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.

Download Full-text

Genomic Prediction of Complex Disease Risk

10.1101/506600 ◽

2018 ◽

Cited By ~ 1

Author(s):

Louis Lello ◽

Timothy G. Raben ◽

Soke Yuen Yong ◽

Laurent CAM Tellier ◽

Stephen D.H. Hsu

Keyword(s):

Genomic Prediction ◽

Complex Disease ◽

Disease Risk ◽

Penalized Regression ◽

Case Control ◽

Nucleotide Polymorphisms ◽

Uk Biobank ◽

Rapid Improvement ◽

Control Data ◽

Polygenic Scores

AbstractWe construct risk predictors using polygenic scores (PGS) computed from common Single Nucleotide Polymorphisms (SNPs) for a number of complex disease conditions, using L1-penalized regression (also known as LASSO) on case-control data from UK Biobank. Among the disease conditions studied are Hypothyroidism, (Resistant) Hypertension, Type 1 and 2 Diabetes, Breast Cancer, Prostate Cancer, Testicular Cancer, Gallstones, Glaucoma, Gout, Atrial Fibrillation, High Cholesterol, Asthma, Basal Cell Carcinoma, Malignant Melanoma, and Heart Attack. We obtain values for the area under the receiver operating characteristic curves (AUC) in the range ~ 0.58 – 0.71 using SNP data alone. Substantially higher predictor AUCs are obtained when incorporating additional variables such as age and sex. Some SNP predictors alone are sufficient to identify outliers (e.g., in the 99th percentile of PGS) with 3 – 8 times higher risk than typical individuals. We validate predictors out-of-sample using the eMERGE dataset, and also with different ancestry subgroups within the UK Biobank population. Our results indicate that substantial improvements in predictive power are attainable using training sets with larger case populations. We anticipate rapid improvement in genomic prediction as more case-control data become available for analysis.

Download Full-text

LDpred2: better, faster, stronger

10.1101/2020.04.28.066720 ◽

2020 ◽

Cited By ~ 3

Author(s):

Florian Privé ◽

Julyan Arbel ◽

Bjarni J. Vilhjálmsson

Keyword(s):

Human Genetics ◽

Predictive Accuracy ◽

Predictive Performance ◽

Real Data ◽

R Package ◽

Summary Statistics ◽

Genetics Research ◽

Genome Wide ◽

Polygenic Scores ◽

Central Tool

AbstractPolygenic scores have become a central tool in human genetics research. LDpred is a popular method for deriving polygenic scores based on summary statistics and a matrix of correlation between genetic variants. However, LDpred has limitations that may reduce its predictive performance. Here we present LDpred2, a new version of LDpred that addresses these issues. We also provide two new options in LDpred2: a “sparse” option that can learn effects that are exactly 0, and an “auto” option that directly learns the two LDpred parameters from data. We benchmark predictive performance of LDpred2 against the previous version on simulated and real data, demonstrating substantial improvements in robustness and predictive accuracy compared to LDpred1. We then show that LDpred2 also outperforms other polygenic score methods recently developed, with a mean AUC over the 8 real traits analyzed here of 65.1%, compared to 63.8% for lassosum, 62.9% for PRS-CS and 61.5% for SBayesR. Note that, in contrast to what was recommended in the first version of this paper, we now recommend to run LDpred2 genome-wide instead of per chromosome. LDpred2 is implemented in R package bigsnpr.

Download Full-text

A tool for translating polygenic scores onto the absolute scale using summary statistics

European Journal of Human Genetics ◽

10.1038/s41431-021-01028-z ◽

2022 ◽

Author(s):

Oliver Pain ◽

Alexandra C. Gillett ◽

Jehannine C. Austin ◽

Lasse Folkersen ◽

Cathryn M. Lewis

Keyword(s):

Genome Wide Association Study ◽

Absolute Risk ◽

Summary Statistics ◽

Clinical Implementation ◽

Polygenic Score ◽

Absolute Scale ◽

Polygenic Scores ◽

The Absolute ◽

The Uk ◽

Normally Distributed

AbstractThere is growing interest in the clinical application of polygenic scores as their predictive utility increases for a range of health-related phenotypes. However, providing polygenic score predictions on the absolute scale is an important step for their safe interpretation. We have developed a method to convert polygenic scores to the absolute scale for binary and normally distributed phenotypes. This method uses summary statistics, requiring only the area-under-the-ROC curve (AUC) or variance explained (R2) by the polygenic score, and the prevalence of binary phenotypes, or mean and standard deviation of normally distributed phenotypes. Polygenic scores are converted using normal distribution theory. We also evaluate methods for estimating polygenic score AUC/R2 from genome-wide association study (GWAS) summary statistics alone. We validate the absolute risk conversion and AUC/R2 estimation using data for eight binary and three continuous phenotypes in the UK Biobank sample. When the AUC/R2 of the polygenic score is known, the observed and estimated absolute values were highly concordant. Estimates of AUC/R2 from the lassosum pseudovalidation method were most similar to the observed AUC/R2 values, though estimated values deviated substantially from the observed for autoimmune disorders. This study enables accurate interpretation of polygenic scores using only summary statistics, providing a useful tool for educational and clinical purposes. Furthermore, we have created interactive webtools implementing the conversion to the absolute (https://opain.github.io/GenoPred/PRS_to_Abs_tool.html). Several further barriers must be addressed before clinical implementation of polygenic scores, such as ensuring target individuals are well represented by the GWAS sample.

Download Full-text