scholarly journals Deviations from Hardy Weinberg Equilibrium at CCR5-Δ32 in Large Sequencing Data Sets

2019 ◽  
Author(s):  
Xinzhu Wei ◽  
Rasmus Nielsen

AbstractPrevious analyses of the UK Biobank (UKB) genotyping array data in the CCR5-Δ32 locus show evidence for deviations from Hardy-Weinberg Equilibrium (HWE) and an increased mortality rate of homozygous individuals, consistent with a recessive deleterious effect of the deletion mutation. We here examine if similar deviations from HWE can be observed in the newly released UKB Whole Exome Sequencing (WES) data and in the sequencing data of the Genome Aggregation Database (gnomAD). We also examine the reliability of the genotype calls in the UKB array data. The UKB genotyping array probe targeting CCR5-Δ32 (rs62625034) and the WES of Δ32 are strongly correlated (r2 = 0.97). This contrasts to tag SNPs of CCR5-Δ32 in the UKB which have high missing data rates and imputation errors rates. We also show that, while different data sets are subject to different biases, both the UKB-WES and the gnomAD data have a deficiency of homozygous CCR5-Δ32 individuals compared to the HWE expectation (combined P-value < 0.01), consistent with an increased mortality rate in homozygotes. Finally, we perform a survival analysis on data from parents of UKB volunteers, that, while underpowered, is also consistent with the original report of a deleterious effect of CCR5-Δ32 in the homozygous state.

2019 ◽  
Author(s):  
Robert Maier ◽  
Ali Akbari ◽  
Xinzhu Wei ◽  
Nick Patterson ◽  
Rasmus Nielsen ◽  
...  

AbstractA recent study reported that a 32-base-pair deletion in the CCR5 gene (CCR5-∆32) is deleterious in the homozygous state in humans. Evidence for this came from a survival analysis in the UK Biobank cohort, and from deviations from Hardy-Weinberg equilibrium at a polymorphism tagging the deletion (rs62625034). Here, we carry out a joint analysis of whole-genome genotyping data and whole-exome sequencing data from the UK Biobank, which reveals that technical artifacts are a more plausible cause for deviations from Hardy-Weinberg equilibrium at this polymorphism. Specifically, we find that individuals homozygous for the deletion in the sequencing data are underrepresented in the genotyping data due to an elevated rate of missing data at rs62625034, possibly because the probe for this SNP overlaps with the ∆32 deletion. Another variant which has a higher concordance with the deletion in the sequencing data shows no associations with mortality. A phenome-wide scan for effects of variants tagging this deletion shows an overall inflation of association p-values, but identifies only one trait at p < 5×10−8, and no mediators for an effect on mortality. These analyses show that the original reports of a recessive deleterious effect of CCR5-∆32 are affected by a technical artifact, and that a closer investigation of the same data provides no positive evidence for an effect on lifespan.


2019 ◽  
Author(s):  
Daniel Gudbjartsson ◽  
Patric Sulem ◽  
Kári Stefansson ◽  
Nina Mars ◽  
Juha Karjalainen ◽  
...  

Recently, Wei and Nielsen1 reported an analysis of UK Biobank data which suggested that the well-known HIV-protective variant CCR5-del32 is associated with a 21% increase in all-cause mortality. We demonstrate, using two well-powered population samples in Iceland and Finland with extensive health data and death information, neither an effect on mortality nor increase in risk of any disease. Further reexamination of the UK Biobank (UKBB) data suggests that the very modest association was with a SNP of poor genotyping quality – at a nearby proxy SNP, no statistically significant impact on mortality nor deviation from Hardy-Weinberg equilibrium exists in the UKBB sample. We thus find no evidence of any meaningful risk of increased mortality from homozygosity of CCR5-del32.


Stats ◽  
2020 ◽  
Vol 3 (1) ◽  
pp. 34-39
Author(s):  
Vladimir Ostrovski

We consider testing equivalence to Hardy–Weinberg Equilibrium in case of multiple alleles. Two different test statistics are proposed for this test problem. The asymptotic distribution of the test statistics is derived. The corresponding tests can be carried out using asymptotic approximation. Alternatively, the variance of the test statistics can be estimated by the bootstrap method. The proposed tests are applied to three real data sets. The finite sample performance of the tests is studied by simulations, which are inspired by the real data sets.


Genetics ◽  
2021 ◽  
Author(s):  
Alan M Kwong ◽  
Thomas W Blackwell ◽  
Jonathon LeFaive ◽  
Mariza de Andrade ◽  
John Barnard ◽  
...  

Abstract Traditional Hardy–Weinberg equilibrium (HWE) tests (the χ2 test and the exact test) have long been used as a metric for evaluating genotype quality, as technical artifacts leading to incorrect genotype calls often can be identified as deviations from HWE. However, in data sets composed of individuals from diverse ancestries, HWE can be violated even without genotyping error, complicating the use of HWE testing to assess genotype data quality. In this manuscript, we present the Robust Unified Test for HWE (RUTH) to test for HWE while accounting for population structure and genotype uncertainty, and to evaluate the impact of population heterogeneity and genotype uncertainty on the standard HWE tests and alternative methods using simulated and real sequence data sets. Our results demonstrate that ignoring population structure or genotype uncertainty in HWE tests can inflate false-positive rates by many orders of magnitude. Our evaluations demonstrate different tradeoffs between false positives and statistical power across the methods, with RUTH consistently among the best across all evaluations. RUTH is implemented as a practical and scalable software tool to rapidly perform HWE tests across millions of markers and hundreds of thousands of individuals while supporting standard VCF/BCF formats. RUTH is publicly available at https://www.github.com/statgen/ruth.


2021 ◽  
Author(s):  
Orna Mizrahi Man ◽  
Marcos H Woehrmann ◽  
Teresa A Webster ◽  
Jeremy Gollub ◽  
Adrian Bivol ◽  
...  

Objective: To significantly improve the positive predictive value (PPV) and sensitivity of Applied Biosystems™ Axiom™ array variant calling, by means of novel improvement to genotyping algorithms and careful quality control of array probesets. The improvement makes array genotyping more suitable for very rare variants. Design: Retrospective evaluation of UK Biobank array data re-genotyped with improved algorithms for rare variants. Participant: 488,359 people recruited to the UK Biobank with Axiom array genotyping data including 200,630 with exome sequencing data. Main Outcome Measures: A comparison of genotyping calls from array data to genotyping calls on a subset of variants with exome sequencing data. Results: Axiom genotyping [18] performed well, based on comparison to sequencing data, for over 100,000 common variants directly genotyped on the Axiom UK Biobank array and also exome sequenced by the UK Biobank Exome Sequencing Consortium. However, in a comparison to the initial exome sequencing results of the first 50K individuals, Weedon et al. [1] observed that when grouping these variants by the minor allele frequency (MAF) observed in UK Biobank, the concordance with sequencing and resulting positive predictive value (PPV) decreased with the number of heterozygous (Het) array calls per variant. An improved genotyping algorithm, Rare Heterozygous Adjustment (RHA) [16], released mid-2020 for genotyping on Axiom arrays, significantly improves PPV in all MAF ranges for the 50K data as well as when compared to the exome sequencing of 200K individuals, released after Weedon et al. [1] performed their comparison. The RHA algorithm improved PPVs in the 200K data in the lowest three frequency groups [0, 0.001%), [0.001%, 0.005%) and [0.005%, 0.01%) to 83%, 82% and 88%; respectively. PPV was above 95% for higher MAF ranges without algorithm improvement. PPVs are somewhat higher in the 200K dataset, due to a different "truth set" from exome sequencing and because monomorphic exome loci are not included in the joint genotyping calls for the 200K data set, as explained in the methods section. Sensitivity was higher in the 200K data set than in the original 50K data as well, especially for low MAF ranges. This increase is in part due to the larger data set over which sensitivity could be computed and in part due to the different WES algorithms used for the 200K data [7]. Filtering of a relatively small number of non-performing probesets (determined without reference to the exome sequencing data) significantly improved sensitivities for all MAF ranges, resulting in 70%, 88% and 94% respectively in the three lowest MAF ranges and greater than 98% and 99.9% for the two higher MAF ranges ([0.01%, 1%), [1%, 50%]). Conclusions: Improved algorithms for genotyping along with enhanced quality control of array probesets, significantly improve the positive predictive value and the sensitivity of array data, making it suitable for the detection of very rare variants. The probeset filtering methods developed have resulted in better probe designs for arrays and the new genotyping algorithm is part of the standard algorithm for all Axiom arrays since early 2020.


2021 ◽  
Author(s):  
William Zhu ◽  
Xiaoping Huang ◽  
Esther Yoon ◽  
Sara P Bandres Ciga ◽  
Cornelis Blauwendraat ◽  
...  

PRKN mutations are the most common recessive cause of Parkinson′s disease (PD) and are a promising target for gene and cell replacement therapies. Identification of biallelic PRKN patients (PRKN-PD) at the population scale, however, remains a challenge, as roughly half are copy number variants (CNVs) and many single nucleotide polymorphisms (SNPs) are of unclear significance. Additionally, the true prevalence and disease risk associated with heterozygous PRKN mutations is unclear, as a comprehensive assessment of PRKN SNPs and CNVs has not been performed at a population scale. To address these challenges, we evaluated PRKN mutations in 2 cohorts analyzed with both a genotyping array and exome or genome sequencing: the NIH PD cohort, a deeply phenotyped cohort of PD patients, and the UK Biobank, a population scale cohort with nearly half a million participants. Genotyping array identified the majority of PRKN mutations and at least 1 mutation in most biallelic PRKN mutation carriers in both cohorts. Additionally, in the NIH PD cohort, functional assays of patient fibroblasts resolved variants of unclear significance in biallelic carriers and ruled out cryptic loss of function variants in monoallelic carriers. In the UK Biobank, we identified 2,692 PRKN CNVs from genotyping array data from nearly half a million participants (the largest collection to date). Deletions or duplications involving exons 2 accounted for roughly half of all CNVs and the vast majority (88%) involved exons 2, 3, or 4. Combining estimates from whole exome sequencing (from ~200,000 participants) and genotyping array data, we found a pathogenic PRKN mutation in 1.8% of participants and 2 mutations in ~1/7,800 participants. Those with 1 PRKN pathogenic variant were as likely as non-carriers to have PD (OR = 0.91, CI= 0.58 – 1.38, p-value = 0.76) or a parent with PD (OR = 1.12, CI = 0.94 – 1.31, p-value = 0.19). Together our results demonstrate that heterozygous pathogenic PRKN mutations are common in the population but do not increase the risk of PD. Additionally, they suggest a cost-effective framework to screen for biallelic PRKN patients at the population scale for targeted studies.


2018 ◽  
Author(s):  
Jonas Meisner ◽  
Anders Albrechtsen

AbstractTesting for Hardy-Weinberg Equilibrium (HWE) is a common practice for quality control in genetic studies. Variable sites violating HWE may be identified as technical errors in the sequencing or genotyping process, or they may be of special evolutionary interest. Large-scale genetic studies based on next-generation sequencing (NGS) methods have become more prevalent as cost is decreasing but these methods are still associated with statistical uncertainty. The large-scale studies usually consist of samples from diverse ancestries that make the existence of some degree of population structure almost inevitable. Precautions are therefore needed when analyzing these datasets, as population structure causes deviations from HWE. Here we propose a method that takes population structure into account in the testing for HWE, such that other factors causing deviations from HWE can be detected. We show the effectiveness of our method in NGS data, as well as in genotype data, for both simulated and real datasets, where the use of genotype likelihoods enables us to model the uncertainty for low-depth sequencing data.


BMJ ◽  
2021 ◽  
pp. n214
Author(s):  
Weedon MN ◽  
Jackson L ◽  
Harrison JW ◽  
Ruth KS ◽  
Tyrrell J ◽  
...  

Abstract Objective To determine whether the sensitivity and specificity of SNP chips are adequate for detecting rare pathogenic variants in a clinically unselected population. Design Retrospective, population based diagnostic evaluation. Participants 49 908 people recruited to the UK Biobank with SNP chip and next generation sequencing data, and an additional 21 people who purchased consumer genetic tests and shared their data online via the Personal Genome Project. Main outcome measures Genotyping (that is, identification of the correct DNA base at a specific genomic location) using SNP chips versus sequencing, with results split by frequency of that genotype in the population. Rare pathogenic variants in the BRCA1 and BRCA2 genes were selected as an exemplar for detailed analysis of clinically actionable variants in the UK Biobank, and BRCA related cancers (breast, ovarian, prostate, and pancreatic) were assessed in participants through use of cancer registry data. Results Overall, genotyping using SNP chips performed well compared with sequencing; sensitivity, specificity, positive predictive value, and negative predictive value were all above 99% for 108 574 common variants directly genotyped on the SNP chips and sequenced in the UK Biobank. However, the likelihood of a true positive result decreased dramatically with decreasing variant frequency; for variants that are very rare in the population, with a frequency below 0.001% in UK Biobank, the positive predictive value was very low and only 16% of 4757 heterozygous genotypes from the SNP chips were confirmed with sequencing data. Results were similar for SNP chip data from the Personal Genome Project, and 20/21 individuals analysed had at least one false positive rare pathogenic variant that had been incorrectly genotyped. For pathogenic variants in the BRCA1 and BRCA2 genes, which are individually very rare, the overall performance metrics for the SNP chips versus sequencing in the UK Biobank were: sensitivity 34.6%, specificity 98.3%, positive predictive value 4.2%, and negative predictive value 99.9%. Rates of BRCA related cancers in UK Biobank participants with a positive SNP chip result were similar to those for age matched controls (odds ratio 1.31, 95% confidence interval 0.99 to 1.71) because the vast majority of variants were false positives, whereas sequence positive participants had a significantly increased risk (odds ratio 4.05, 2.72 to 6.03). Conclusions SNP chips are extremely unreliable for genotyping very rare pathogenic variants and should not be used to guide health decisions without validation.


2021 ◽  
Vol 12 (2) ◽  
pp. 317-334
Author(s):  
Omar Alaqeeli ◽  
Li Xing ◽  
Xuekui Zhang

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.


Sign in / Sign up

Export Citation Format

Share Document