scholarly journals Improved polygenic prediction by Bayesian multiple regression on summary statistics

2019 ◽  
Author(s):  
Luke R. Lloyd-Jones ◽  
Jian Zeng ◽  
Julia Sidorenko ◽  
Loïc Yengo ◽  
Gerhard Moser ◽  
...  

ABSTRACTThe capacity to accurately predict an individual’s phenotype from their DNA sequence is one of the great promises of genomics and precision medicine. Recently, Bayesian methods for generating polygenic predictors have been successfully applied in human genomics but require the individual level data, which are often limited in their access due to privacy or logistical concerns, and are computationally very intensive. This has motivated methodological frameworks that utilise publicly available genome-wide association studies (GWAS) summary data, which now for some traits include results from greater than a million individuals. In this study, we extend the established summary statistics methodological framework to include a class of point-normal mixture prior Bayesian regression models, which have been shown to generate optimal genetic predictions and can perform heritability estimation, variant mapping and estimate the distribution of the genetic effects. In a wide range of simulations and cross-validation using 10 real quantitative traits and 1.1 million variants on 350,000 individuals from the UK Biobank (UKB), we establish that our summary based method, SBayesR, performs similarly to methods that use the individual level data and outperforms other state-of-the-art summary statistics methods in terms of prediction accuracy and heritability estimation at a fraction of the computational resources. We generate polygenic predictors for body mass index and height in two independent data sets and show that by exploiting summary statistics on 1.1 million variants from the largest GWAS meta-analysis (n ≈ 700, 000) that the SBayesR prediction R2 improved on average across traits by 6.8% relative to that estimated from an individual-level data BayesR analysis of data from the UKB (n ≈ 450, 000). Compared with commonly used state-of-the-art summary-based methods, SBayesR improved the prediction R2 by 4.1% relative to LDpred and by 28.7% relative to clumping and p-value thresholding. SBayesR gave comparable prediction accuracy to the recent RSS method, which has a similar model, but at a computational time that is two orders of magnitude smaller. The methodology is implemented in a very efficient and user-friendly software tool titled GCTB.

2016 ◽  
Author(s):  
Xiang Zhu ◽  
Matthew Stephens

Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously-proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously-unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Luke R. Lloyd-Jones ◽  
Jian Zeng ◽  
Julia Sidorenko ◽  
Loïc Yengo ◽  
Gerhard Moser ◽  
...  

Abstract Accurate prediction of an individual’s phenotype from their DNA sequence is one of the great promises of genomics and precision medicine. We extend a powerful individual-level data Bayesian multiple regression model (BayesR) to one that utilises summary statistics from genome-wide association studies (GWAS), SBayesR. In simulation and cross-validation using 12 real traits and 1.1 million variants on 350,000 individuals from the UK Biobank, SBayesR improves prediction accuracy relative to commonly used state-of-the-art summary statistics methods at a fraction of the computational resources. Furthermore, using summary statistics for variants from the largest GWAS meta-analysis (n ≈ 700, 000) on height and BMI, we show that on average across traits and two independent data sets that SBayesR improves prediction R2 by 5.2% relative to LDpred and by 26.5% relative to clumping and p value thresholding.


2020 ◽  
Author(s):  
Clara Albiñana ◽  
Jakob Grove ◽  
John J. McGrath ◽  
Esben Agerbo ◽  
Naomi R. Wray ◽  
...  

AbstractThe accuracy of polygenic risk scores (PRSs) to predict complex diseases increases with the training sample size. PRSs are generally derived based on summary statistics from large meta-analyses of multiple genome-wide association studies (GWAS). However, it is now common for researchers to have access to large individual-level data as well, such as the UK biobank data. To the best of our knowledge, it has not yet been explored how to best combine both types of data (summary statistics and individual-level data) to optimize polygenic prediction. The most widely used approach to combine data is the meta-analysis of GWAS summary statistics (Meta-GWAS), but we show that it does not always provide the most accurate PRS. Through simulations and using twelve real case-control and quantitative traits from both iPSYCH and UK Biobank along with external GWAS summary statistics, we compare Meta-GWAS with two alternative data-combining approaches, stacked clumping and thresholding (SCT) and Meta-PRS. We find that, when large individual-level data is available, the linear combination of PRSs (Meta-PRS) is both a simple alternative to Meta-GWAS and often more accurate.


Author(s):  
David M. Wineroither ◽  
Rudolf Metz

AbstractThis report surveys four approaches that are pivotal to the study of preference formation: (a) the range, validity, and theoretical foundations of explanations of political preferences at the individual and mass levels, (b) the exploration of key objects of preference formation attached to the democratic political process (i.e., voting in competitive elections), (c) the top-down vs. bottom-up character of preference formation as addressed in leader–follower studies, and (d) gene–environment interaction and the explanatory weight of genetic predisposition against the cumulative weight of social experiences.In recent years, our understanding of sites and processes of (individual) political-preference formation has substantially improved. First, this applies to a greater variety of objects that provide fresh insight into the functioning and stability of contemporary democracy. Second, we observe the reaffirmation of pivotal theories and key concepts in adapted form against widespread challenge. This applies to the role played by social stratification, group awareness, and individual-level economic considerations. Most of these findings converge in recognising economics-based explanations. Third, research into gene–environment interplay rapidly increases the number of testable hypotheses and promises to benefit a wide range of approaches already taken and advanced in the study of political-preference formation.


2021 ◽  
pp. 003329412110268
Author(s):  
Jaime Ballard ◽  
Adeya Richmond ◽  
Suzanne van den Hoogenhof ◽  
Lynne Borden ◽  
Daniel Francis Perkins

Background Multilevel data can be missing at the individual level or at a nested level, such as family, classroom, or program site. Increased knowledge of higher-level missing data is necessary to develop evaluation design and statistical methods to address it. Methods Participants included 9,514 individuals participating in 47 youth and family programs nationwide who completed multiple self-report measures before and after program participation. Data were marked as missing or not missing at the item, scale, and wave levels for both individuals and program sites. Results Site-level missing data represented a substantial portion of missing data, ranging from 0–46% of missing data at pre-test and 35–71% of missing data at post-test. Youth were the most likely to be missing data, although site-level data did not differ by the age of participants served. In this dataset youth had the most surveys to complete, so their missing data could be due to survey fatigue. Conclusions Much of the missing data for individuals can be explained by the site not administering those questions or scales. These results suggest a need for statistical methods that account for site-level missing data, and for research design methods to reduce the prevalence of site-level missing data or reduce its impact. Researchers can generate buy-in with sites during the community collaboration stage, assessing problematic items for revision or removal and need for ongoing site support, particularly at post-test. We recommend that researchers conducting multilevel data report the amount and mechanism of missing data at each level.


Author(s):  
Michele J. Gelfand ◽  
Nava Caluori ◽  
Sarah Gordon ◽  
Jana Raver ◽  
Lisa Nishii ◽  
...  

Research on culture has generally ignored social situations, and research on social situations has generally ignored culture. In bringing together these two traditions, we show that nations vary considerably in the strength of social situations, and this is a key conceptual and empirical bridge between macro and distal cultural processes and micro and proximal psychological processes. The model thus illustrates some of the intervening mechanisms through which distal societal factors affect individual processes. It also helps to illuminate why cultural differences persist at the individual level, as they are adaptive to chronic differences in the strength of social situations. The strength of situations across cultures can provide new insights into cultural differences in a wide range of psychological processes.


2018 ◽  
Vol 49 (13) ◽  
pp. 2197-2205 ◽  
Author(s):  
Hannah M. Sallis ◽  
George Davey Smith ◽  
Marcus R. Munafò

AbstractBackgroundDespite the well-documented association between smoking and personality traits such as neuroticism and extraversion, little is known about the potential causal nature of these findings. If it were possible to unpick the association between personality and smoking, it may be possible to develop tailored smoking interventions that could lead to both improved uptake and efficacy.MethodsRecent genome-wide association studies (GWAS) have identified variants robustly associated with both smoking phenotypes and personality traits. Here we use publicly available GWAS summary statistics in addition to individual-level data from UK Biobank to investigate the link between smoking and personality. We first estimate genetic overlap between traits using LD score regression and then use bidirectional Mendelian randomisation methods to unpick the nature of this relationship.ResultsWe found clear evidence of a modest genetic correlation between smoking behaviours and both neuroticism and extraversion. We found some evidence that personality traits are causally linked to certain smoking phenotypes: among current smokers each additional neuroticism risk allele was associated with smoking an additional 0.07 cigarettes per day (95% CI 0.02–0.12, p = 0.009), and each additional extraversion effect allele was associated with an elevated odds of smoking initiation (OR 1.015, 95% CI 1.01–1.02, p = 9.6 × 10−7).ConclusionWe found some evidence for specific causal pathways from personality to smoking phenotypes, and weaker evidence of an association from smoking initiation to personality. These findings could be used to inform future smoking interventions or to tailor existing schemes.


2017 ◽  
Author(s):  
Ronald de Vlaming ◽  
Magnus Johannesson ◽  
Patrik K.E. Magnusson ◽  
M. Arfan Ikram ◽  
Peter M. Visscher

AbstractLD-score (LDSC) regression disentangles the contribution of polygenic signal, in terms of SNP-based heritability, and population stratification, in terms of a so-called intercept, to GWAS test statistics. Whereas LDSC regression uses summary statistics, methods like Haseman-Elston (HE) regression and genomic-relatedness-matrix (GRM) restricted maximum likelihood infer parameters such as SNP-based heritability from individual-level data directly. Therefore, these two types of methods are typically considered to be profoundly different. Nevertheless, recent work has revealed that LDSC and HE regression yield near-identical SNP-based heritability estimates when confounding stratification is absent. We now extend the equivalence; under the stratification assumed by LDSC regression, we show that the intercept can be estimated from individual-level data by transforming the coefficients of a regression of the phenotype on the leading principal components from the GRM. Using simulations, considering various degrees and forms of population stratification, we find that intercept estimates obtained from individual-level data are nearly equivalent to estimates from LDSC regression (R2> 99%). An empirical application corroborates these findings. Hence, LDSC regression is not profoundly different from methods using individual-level data; parameters that are identified by LDSC regression are also identified by methods using individual-level data. In addition, our results indicate that, under strong stratification, there is misattribution of stratification to the slope of LDSC regression, inflating estimates of SNP-based heritability from LDSC regression ceteris paribus. Hence, the intercept is not a panacea for population stratification. Consequently, LDSC-regression estimates should be interpreted with caution, especially when the intercept estimate is significantly greater than one.


2020 ◽  
Author(s):  
Xing Zhao ◽  
Feng Hong ◽  
Jianzhong Yin ◽  
Wenge Tang ◽  
Gang Zhang ◽  
...  

AbstractCohort purposeThe China Multi-Ethnic Cohort (CMEC) is a community population-based prospective observational study aiming to address the urgent need for understanding NCD prevalence, risk factors and associated conditions in resource-constrained settings for ethnic minorities in China.Cohort BasicsA total of 99 556 participants aged 30 to 79 years (Tibetan populations include those aged 18 to 30 years) from the Tibetan, Yi, Miao, Bai, Bouyei, and Dong ethnic groups in Southwest China were recruited between May 2018 and September 2019.Follow-up and attritionAll surviving study participants will be invited for re-interviews every 3-5 years with concise questionnaires to review risk exposures and disease incidence. Furthermore, the vital status of study participants will be followed up through linkage with established electronic disease registries annually.Design and MeasuresThe CMEC baseline survey collected data with an electronic questionnaire and face-to-face interviews, medical examinations and clinical laboratory tests. Furthermore, we collected biological specimens, including blood, saliva and stool, for long-term storage. In addition to the individual level data, we also collected regional level data for each investigation site.Collaboration and data accessCollaborations are welcome. Please send specific ideas to corresponding author at: [email protected].


Sign in / Sign up

Export Citation Format

Share Document