Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification

Andre Ostrak; Jaak Randmets; Ville Sokk; Sven Laur; Liina Kamm

doi:10.3390/cryptography5030021

Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification

Cryptography ◽

10.3390/cryptography5030021 ◽

2021 ◽

Vol 5 (3) ◽

pp. 21

Author(s):

Andre Ostrak ◽

Jaak Randmets ◽

Ville Sokk ◽

Sven Laur ◽

Liina Kamm

Keyword(s):

Population Stratification ◽

Data Exchange ◽

Association Studies ◽

Principal Component ◽

Heterogeneous Data ◽

Privacy Preserving ◽

Phenotypic Traits ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Secure Computing

In bioinformatics, genome-wide association studies (GWAS) are used to detect associations between single-nucleotide polymorphisms (SNPs) and phenotypic traits such as diseases. Significant differences in SNP counts between case and control groups can signal association between variants and phenotypic traits. Most traits are affected by multiple genetic locations. To detect these subtle associations, bioinformaticians need access to more heterogeneous data. Regulatory restrictions in cross-border health data exchange have created a surge in research on privacy-preserving solutions, including secure computing techniques. However, in studies of such scale, one must account for population stratification, as under- and over-representation of sub-populations can lead to spurious associations. We improve on the state of the art of privacy-preserving GWAS methods by showing how to adapt principal component analysis (PCA) with stratification control (EIGENSTRAT), FastPCA, EMMAX and the genomic control algorithm for secure computing. We implement these methods using secure computing techniques—secure multi-party computation (MPC) and trusted execution environments (TEE). Our algorithms are the most complex ones at this scale implemented with MPC. We present performance benchmarks and a security and feasibility trade-off discussion for both techniques.

Download Full-text

Powerful Tukey's One Degree-of-Freedom Test for Detecting Gene-Gene and Gene-Environment Interactions

Cancer Informatics ◽

10.4137/cin.s17305 ◽

2015 ◽

Vol 14s2 ◽

pp. CIN.S17305 ◽

Cited By ~ 1

Author(s):

Yaping Wang ◽

Donghui Li ◽

Peng Wei

Keyword(s):

Statistical Power ◽

Association Studies ◽

Score Test ◽

Principal Component ◽

Case Control ◽

Degree Of Freedom ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Missing Heritability ◽

Gene Environment

Genome-wide association studies (GWASs) have identified thousands of single nucleotide polymorphisms (SNPs) robustly associated with hundreds of complex human diseases including cancers. However, the large number of G WAS-identified genetic loci only explains a small proportion of the disease heritability. This “missing heritability” problem has been partly attributed to the yet-to-be-identified gene-gene (G × G) and gene-environment (G × E) interactions. In spite of the important roles of G × G and G × E interactions in understanding disease mechanisms and filling in the missing heritability, straightforward GWAS scanning for such interactions has very limited statistical power, leading to few successes. Here we propose a two-step statistical approach to test G × G/G × E interactions: the first step is to perform principal component analysis (PCA) on the multiple SNPs within a gene region, and the second step is to perform Tukey's one degree-of-freedom (1-df) test on the leading PCs. We derive a score test that is computationally fast and numerically stable for the proposed Tukey's 1-df interaction test. Using extensive simulations we show that the proposed approach, which combines the two parsimonious models, namely, the PCA and Tukey's 1-df form of interaction, outperforms other state-of-the-art methods. We also demonstrate the utility and efficiency gains of the proposed method with applications to testing G × G interactions for Crohn's disease using the Wellcome Trust Case Control Consortium (WTCCC) GWAS data and testing G × E interaction using data from a case-control study of pancreatic cancer.

Download Full-text

Population stratification in GWAS meta-analysis should be standardized to the best available reference datasets

10.1101/2020.09.03.281568 ◽

2020 ◽

Author(s):

Aliya Sarmanova ◽

Tim Morris ◽

Daniel John Lawson

Keyword(s):

Population Stratification ◽

Association Studies ◽

Meta Analysis ◽

Principal Component ◽

Underlying Structure ◽

Genome Wide Association Studies ◽

Uk Biobank ◽

External Reference ◽

Major Disadvantage ◽

The Uk

AbstractPopulation stratification has recently been demonstrated to bias genetic studies even in relatively homogeneous populations such as within the British Isles. A key component to correcting for stratification in genome-wide association studies (GWAS) is accurately identifying and controlling for the underlying structure present in the sample. Meta-analysis across cohorts is increasingly important for achieving very large sample sizes, but comes with the major disadvantage that each individual cohort corrects for different population stratification. Here we demonstrate that correcting for structure against an external reference adds significant value to meta-analysis. We treat the UK Biobank as a collection of smaller studies, each of which is geographically localised. We provide software to standardize an external dataset against a reference, provide the UK Biobank principal component loadings for this purpose, and demonstrate the value of this with an analysis of the geographically sampled ALSPAC cohort.

Download Full-text

PCA-Based Multiple-Trait GWAS Analysis: A Powerful Model for Exploring Pleiotropy

Animals ◽

10.3390/ani8120239 ◽

2018 ◽

Vol 8 (12) ◽

pp. 239 ◽

Cited By ~ 4

Author(s):

Wengang Zhang ◽

Xue Gao ◽

Xinping Shi ◽

Bo Zhu ◽

Zezhao Wang ◽

...

Keyword(s):

Quantitative Trait ◽

Statistical Power ◽

Muscle Development ◽

Association Studies ◽

Simulated Data ◽

Principal Component ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Multiple Trait ◽

Gwas Analysis

Principal component analysis (PCA) is a potential approach that can be applied in multiple-trait genome-wide association studies (GWAS) to explore pleiotropy, as well as increase the power of quantitative trait loci (QTL) detection. In this study, the relationship of test single nucleotide polymorphisms (SNPs) was determined between single-trait GWAS and PCA-based GWAS. We found that the estimated pleiotropic quantitative trait nucleotides (QTNs) β * ^ were in most cases larger than the single-trait model estimations ( β 1 ^ and β 2 ^ ). Analysis using the simulated data showed that PCA-based multiple-trait GWAS has improved statistical power for detecting QTL compared to single-trait GWAS. For the minor allele frequency (MAF), when the MAF of QTNs was greater than 0.2, the PCA-based model had a significant advantage in detecting the pleiotropic QTNs, but when its MAF was reduced from 0.2 to 0, the advantage began to disappear. In addition, as the linkage disequilibrium (LD) of the pleiotropic QTNs decreased, its detection ability declined in the co-localization effect model. Furthermore, on the real data of 1141 Simmental cattle, we applied the PCA model to the multiple-trait GWAS analysis and identified a QTL that was consistent with a candidate gene, MCHR2, which was associated with presoma muscle development in cattle. In summary, PCA-based multiple-trait GWAS is an efficient model for exploring pleiotropic QTNs in quantitative traits.

Download Full-text

Genomic Analysis of Vavilov’s Historic Chickpea Landraces Reveals Footprints of Environmental and Human Selection

International Journal of Molecular Sciences ◽

10.3390/ijms21113952 ◽

2020 ◽

Vol 21 (11) ◽

pp. 3952 ◽

Cited By ~ 1

Author(s):

Alena Sokolkova ◽

Sergey V. Bulyntsev ◽

Peter L. Chang ◽

Noelia Carrasquilla-Garcia ◽

Anna A. Igolkina ◽

...

Keyword(s):

Association Studies ◽

Genomic Analysis ◽

Phenotypic Traits ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Multiple Traits ◽

Phenotypic Data ◽

Genome Wide ◽

Bioclimatic Variables ◽

Modern Breeding

A defining challenge of the 21st century is meeting the nutritional demands of the growing human population, under a scenario of limited land and water resources and under the specter of climate change. The Vavilov seed bank contains numerous landraces collected nearly a hundred years ago, and thus may contain ‘genetic gems’ with the potential to enhance modern breeding efforts. Here, we analyze 407 landraces, sampled from major historic centers of chickpea cultivation and secondary diversification. Genome-Wide Association Studies (GWAS) conducted on both phenotypic traits and bioclimatic variables at landraces sampling sites as extended phenotypes resulted in 84 GWAS hits associated to various regions. The novel haploblock-based test identified haploblocks enriched for single nucleotide polymorphisms (SNPs) associated with phenotypes and bioclimatic variables. Subsequent bi-clustering of traits sharing enriched haploblocks underscored both non-random distribution of SNPs among several haploblocks and their association with multiple traits. We hypothesize that these clusters of pleiotropic SNPs represent co-adapted genetic complexes to a range of environmental conditions that chickpea experienced during domestication and subsequent geographic radiation. Linking genetic variation to phenotypic data and a wealth of historic information preserved in historic seed banks are the keys for genome-based and environment-informed breeding intensification.

Download Full-text

An explainable model of host genetic interactions linked to COVID-19 severity

10.21203/rs.3.rs-1062190/v2 ◽

2022 ◽

Author(s):

Anthony Onoja ◽

Nicola Picchiotti ◽

Chiara Fallerini ◽

Margherita Baldassarri ◽

Francesca Fava ◽

...

Keyword(s):

Molecular Mechanisms ◽

Genetic Factors ◽

Association Studies ◽

Principal Component ◽

Phenotypic Traits ◽

Genome Wide Association Studies ◽

Lectin Receptor ◽

Increased Risk ◽

Importance Analysis ◽

Host Genetic

Abstract We employed a multifaceted computational strategy to identify the genetic factors contributing to increased risk of severe COVID-19 infection from a Whole Exome Sequencing (WES) dataset of a cohort of 2000 Italian patients. We coupled a stratified k-fold screening, to rank variants more associated with severity, with training of multiple supervised classifiers, to predict severity on the basis of screened features. Feature importance analysis from tree-based models allowed to identify a handful of 16 variants with highest support which, together with age and gender covariates, were found to be most predictive of COVID-19 severity. When tested on a follow-up cohort, our ensemble of models predicted severity with good accuracy (ACC=81.88%; ROC_AUC=96%; MCC=61.55%). Principal Component Analysis (PCA) and clustering of patients on important variants orthogonally identified two groups of individuals with a higher fraction of severe cases. Our model recapitulated a vast literature of emerging molecular mechanisms and genetic factors linked to COVID-19 response and extends previous landmark Genome Wide Association Studies (GWAS). It revealed a network of interplaying genetic signatures converging on established immune system and inflammatory processes linked to viral infection response, such as JAK-STAT, Cytokine, Interleukin, and C-type lectin receptor signaling. It also identified additional processes cross-talking with immune pathways, such as GPCR signalling, which might offer additional opportunities for therapeutic intervention and patient stratification. Publicly available PheWAS datasets revealed that several variants were significantly associated with phenotypic traits such as “Respiratory or thoracic disease”, confirming their link with COVID-19 severity outcome. Taken together, our analysis suggests that curated genetic information can be effectively integrated along with other patient clinical covariates to forecast COVID-19 disease severity and dissect the underlying host genetic mechanisms for personalized medicine treatments.

Download Full-text

Evaluation of methods for adjusting population stratification in genome‐wide association studies: Standard versus categorical principal component analysis

Annals of Human Genetics ◽

10.1111/ahg.12339 ◽

2019 ◽

Vol 83 (6) ◽

pp. 454-464

Author(s):

Asuman S. Turkmen ◽

Yuan Yuan ◽

Nedret Billor

Keyword(s):

Principal Component Analysis ◽

Population Stratification ◽

Association Studies ◽

Principal Component ◽

Component Analysis ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Categorical Principal Component Analysis ◽

Evaluation Of Methods

Download Full-text

CluStrat: a structure informed clustering strategy for population stratification

10.1101/2020.01.15.908228 ◽

2020 ◽

Cited By ~ 1

Author(s):

Aritra Bose ◽

Myson C. Burch ◽

Agniva Chowdhury ◽

Peristera Paschou ◽

Petros Drineas

Keyword(s):

Mahalanobis Distance ◽

Population Stratification ◽

Association Studies ◽

Principal Component ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Standard Population ◽

Leverage Scores ◽

Complex Population ◽

Causal Variants

AbstractGenome-wide association studies (GWAS) have been extensively used to estimate the signed effects of trait-associated alleles. Recent independent studies failed to replicate the strong evidence of selection for height across Europe implying the shortcomings of standard population stratification correction approaches. Here, we present CluStrat, a stratification correction algorithm for complex population structure that leverages the linkage disequilibrium (LD)-induced distances between individuals. CluStrat performs agglomerative hierarchical clustering using the Mahalanobis distance and then applies sketching-based randomized ridge regression on the genotype data to obtain the association statistics. With the growing size of data, computing and storing the genome wide covariance matrix is a non-trivial task. We get around this overhead by computing the GRM directly using a connection between statistical leverage scores and the Mahalanobis distance. We test CluStrat on a large simulation study of discrete and admixed, arbitrarily-structured sub-populations identifying two to three-fold more true causal variants when compared to Principal Component (PC) based stratification correction methods while trading off for a slightly higher spurious associations. Applying CluStrat on WTCCC2 Parkinson’s disease (PD) data, we identified loci mapped to a host of genes associated with PD such as BACH2, MAP2, NR4A2, SLC11A1, UNC5C to name a few.Availability and ImplementationCluStrat source code and user manual is available at: https://github.com/aritra90/CluStrat

Download Full-text

Effect of population stratification on the identification of significant single-nucleotide polymorphisms in genome-wide association studies

BMC Proceedings ◽

10.1186/1753-6561-3-s7-s13 ◽

2009 ◽

Vol 3 (Suppl 7) ◽

pp. S13 ◽

Cited By ~ 10

Author(s):

Sara M Sarasua ◽

Julianne S Collins ◽

Dhelia M Williamson ◽

Glen A Satten ◽

Andrew S Allen

Keyword(s):

Single Nucleotide Polymorphisms ◽

Population Stratification ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Genome Wide

Download Full-text

Candidate genes for productivity identified by genome-wide association study with indicators of class in the Russian meat merino sheep breed

Vavilov Journal of Genetics and Breeding ◽

10.18699/vj20.681 ◽

2020 ◽

Vol 24 (8) ◽

pp. 836-843

Author(s):

A. Y. Krivoruchko ◽

O. A. Yatsyk ◽

E. Y. Safaryan

Keyword(s):

Candidate Genes ◽

Genome Wide Association Study ◽

Association Studies ◽

Genome Wide Association ◽

Phenotypic Traits ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

High Association ◽

Genome Wide ◽

Individual Snps

Genome-wide association studies allow identification of loci and polymorphisms associated with the formation of relevant phenotypes. When conducting a full genome analysis of sheep, particularly promising is the study of individuals with outstanding productivity indicators – exhibition animals, representatives of the super-elite class. The aim of this study was to identify new candidate genes for economically valuable traits based on the search for single nucleotide polymorphisms (SNPs) associated with belonging to different evaluation classes in rams of the Russian meat merino breed. Animal genotyping was performed using Ovine Infinium HD BeadChip 600K DNA, association search was performed using PLINK v. 1.07 software. Highly reliable associations were found between animals belonging to different evaluation classes and the frequency of occurrence of individual SNPs on chromosomes 2, 6, 10, 13, and 20. Most of the substitutions with high association reliability are concentrated on chromosome 10 in the region 10: 30859297–31873769. To search for candidate genes, 15 polymorphisms with the highest association reliability were selected (–log10(р) > 9). Determining the location of the analyzed SNPs relative to the latest annotation Oar_rambouillet_v1.0 allowed to identify 11 candidate genes presumably associated with the formation of a complex of phenotypic traits of animals in the exhibition group: RXFP2, ALOX5AP, MEDAG, OPN5, PRDM5, PTPRT, TRNAS-GGA, EEF1A1, FRY, ZBTB21-like, and B3GLCT-like. The listed genes encode proteins involved in the control of the cell cycle and DNA replication, regulation of cell proliferation and apoptosis, lipid and carbohydrate metabolism, the development of the inflammatory process and the work of circadian rhythms. Thus, the candidate genes under consideration can influence the formation of exterior features and productive qualities of sheep. However, further research is needed to confirm the influence of genes and determine the exact mechanisms for implementing this influence on the phenotype.

Download Full-text

Molecular Characterization of Global Finger Millet (Eleusine coracana, L. Gaertn) germplasm Reaction to Striga in Kenya

Asian Journal of Biochemistry, Genetics and Molecular Biology ◽

10.9734/ajbgmb/2018/v1i2493 ◽

2018 ◽

pp. 1-14

Author(s):

Sirengo Peter Nyongesa ◽

Wamalwa Dennis Simiyu ◽

Oduor Chrispus ◽

Odeny Damaris Achieng ◽

Dangasuk Otto George

Keyword(s):

Finger Millet ◽

Association Studies ◽

Block Design ◽

Principal Component ◽

Genotyping By Sequencing ◽

Eleusine Coracana ◽

Striga Hermonthica ◽

Genome Wide Association Studies ◽

Nucleotide Polymorphisms ◽

Striga Resistance

Finger millet (Eleusine coracana, L. Gaertn) is an important food crop in Africa and Asia. The parasitic weed Striga hermonthica (Del.) Benth limits finger millet production through reduced yield in agro-ecologies where they exist. The damage of Striga to cereal crops is more severe under drought and low soil fertility. This study aims to determine genetic basis for reaction to Striga hermonthica among the selected germplasm of finger millets through genotyping by sequencing (GBS). One hundred finger millet genotypes were evaluated for reaction to Striga hermonthica infestation under field conditions at Alupe and Kibos in Western Kenya. The experiment was laid out in a randomized complete block design (RCBD) consisting of 10 x 10 square (triple lattice) under Striga (inoculated) and no Striga conditions and plant growth monitored to maturity after 110 days. All genotypes were genotyped by genotyping by sequencing (GBS) and data analyzed using the non-reference based Universal Network Enabled Analysis Kit (UNEAK) pipeline. Genome wide association studies (GWAS) were done to establish the association of detected Single Nucleotide Polymorphisms (SNPs) with Striga reaction based on field results. In molecular analysis 117,542 SNPs from raw GBS data used in GWAS revealed that markers TP 85424 and TP 88244 were associated with Striga resistance in the 95 genotypes. Principal Component Analysis revealed that the first and third component axes accounted for 2.5 and 8% of total variance respectively and the genotypes were distributed according to their reaction to Striga weed. Genetic diversity analysis grouped the 95 accessions into three major clusters containing; 32 (A), 56 (B), and 7 (C) genotypes. All finger millet genotypes that showed high resistance to Striga in the field were from cluster B while the most susceptible genotypes were from clusters A and C. Results revealed genetic variation for Striga resistance in cultivated finger millet genotypes and hence the possibility of marker –assisted breeding for resistance to Striga.

Download Full-text