Power and type I error results for a bias-correction approach recently shown to provide accurate odds ratios of genetic variants for the secondary phenotypes associated with primary diseases

Jian Wang; Sanjay Shete

doi:10.1002/gepi.20611

Statistical inference of genetic pathway analysis in high dimensions

Biometrika ◽

10.1093/biomet/asz033 ◽

2019 ◽

Vol 106 (3) ◽

pp. 651-651

Author(s):

Yang Liu ◽

Wei Sun ◽

Alexander P Reiner ◽

Charles Kooperberg ◽

Qianchuan He

Keyword(s):

Statistical Inference ◽

Pathway Analysis ◽

Genetic Variants ◽

Error Control ◽

Genome Wide Association Study ◽

Type I Error ◽

High Density Lipoproteins ◽

Type I ◽

Genetic Pathway ◽

A Genome

Summary Genetic pathway analysis has become an important tool for investigating the association between a group of genetic variants and traits. With dense genotyping and extensive imputation, the number of genetic variants in biological pathways has increased considerably and sometimes exceeds the sample size $n$. Conducting genetic pathway analysis and statistical inference in such settings is challenging. We introduce an approach that can handle pathways whose dimension $p$ could be greater than $n$. Our method can be used to detect pathways that have nonsparse weak signals, as well as pathways that have sparse but stronger signals. We establish the asymptotic distribution for the proposed statistic and conduct theoretical analysis on its power. Simulation studies show that our test has correct Type I error control and is more powerful than existing approaches. An application to a genome-wide association study of high-density lipoproteins demonstrates the proposed approach.

Download Full-text

Optimal selection of genetic variants for adjustment of population stratification in European association studies

Briefings in Bioinformatics ◽

10.1093/bib/bbz023 ◽

2019 ◽

Vol 21 (3) ◽

pp. 753-761 ◽

Cited By ~ 2

Author(s):

Regina Brinster ◽

Dominique Scherer ◽

Justo Lorenzo Bermejo

Keyword(s):

Genetic Variants ◽

Population Stratification ◽

Statistical Power ◽

Type I Error ◽

Association Studies ◽

Reference Sample ◽

Error Rates ◽

The Cancer Genome Atlas ◽

Type I ◽

Genotype Data

Abstract Population stratification is usually corrected relying on principal component analysis (PCA) of genome-wide genotype data, even in populations considered genetically homogeneous, such as Europeans. The need to genotype only a small number of genetic variants that show large differences in allele frequency among subpopulations—so-called ancestry-informative markers (AIMs)—instead of the whole genome for stratification adjustment could represent an advantage for replication studies and candidate gene/pathway studies. Here we compare the correction performance of classical and robust principal components (PCs) with the use of AIMs selected according to four different methods: the informativeness for assignment measure ($IN$-AIMs), the combination of PCA and F-statistics, PCA-correlated measurement and the PCA weighted loadings for each genetic variant. We used real genotype data from the Population Reference Sample and The Cancer Genome Atlas to simulate European genetic association studies and to quantify type I error rate and statistical power in different case–control settings. In studies with the same numbers of cases and controls per country and control-to-case ratios reflecting actual rates of disease prevalence, no adjustment for population stratification was required. The unnecessary inclusion of the country of origin, PCs or AIMs as covariates in the regression models translated into increasing type I error rates. In studies with cases and controls from separate countries, no investigated method was able to adequately correct for population stratification. The first classical and the first two robust PCs achieved the lowest (although inflated) type I error, followed at some distance by the first eight $IN$-AIMs.

Download Full-text

Super-delta2: An Enhanced Differential Expression Analysis Procedure for Multi-Group Comparisons of RNA-seq Data

10.1101/2021.01.30.428977 ◽

2021 ◽

Author(s):

Zihan Cui ◽

Yuhang Liu ◽

Jinfeng Zhang ◽

Xing Qiu

Keyword(s):

Breast Cancer ◽

Differential Expression ◽

Expression Analysis ◽

Bias Correction ◽

Type I Error ◽

Breast Cancer Dataset ◽

Type I ◽

Rna Seq ◽

Differential Expression Pattern ◽

Group Comparisons

AbstractBackgroundWe developed super-delta2, a differential gene expression analysis pipeline designed for multi-group comparisons for RNA-seq data. It includes a customized one-way ANOVA F-test and a post-hoc test for pairwise group comparisons; both are designed to work with a multivariate normalization procedure to reduce technical noise. It also includes a trimming procedure with bias-correction to obtain robust and approximately unbiased summary statistics used in these tests. We demonstrated the asymptotic applicability of super-delta2 to log-transformed read counts in RNA-seq data by large sample theory based on Negative Binomial Poisson (NBP) distribution.ResultsWe compared super-delta2 with three commonly used RNA-seq data analysis methods: limma/voom, edgeR, and DESeq2 using both simulated and real datasets. In all three simulation settings, super-delta2 not only achieved the best overall statistical power, but also was the only method that controlled type I error at the nominal level. When applied to a breast cancer dataset to identify differential expression pattern associated with multiple pathologic stages, super-delta2 selected more enriched pathways than other methods, which are directly linked to the underlying biological condition (breast cancer).ConclusionsBy incorporating trimming and bias-correction in the normalization step, super-delta2 was able to achieve tight control of type I error. Because the hypothesis tests are based on asymptotic normal approximation of the NBP distribution, super-delta2 does not require computationally expensive iterative optimization procedures used by methods such as edgeR and DESeq2, which occasionally have convergence issues.

Download Full-text

Transformasi Terhadap Min bagi Menguji Taburan Terpencong

Jurnal Teknologi ◽

10.11113/jt.v37.528 ◽

2012 ◽

Author(s):

Nor Haniza Sarmin ◽

Md Hanafiah Md Zin ◽

Rasidah Hussin

Keyword(s):

Key Words ◽

Simulation Study ◽

Bias Correction ◽

Type I Error ◽

Small Sample Size ◽

Small Sample ◽

Skewed Distribution ◽

Type I ◽

Skewed Distributions ◽

Chi Square

Suatu transformasi terhadap min dilakukan menggunakan penganggar pembetulan kepincangan bagi mendapatkan statistik untuk menguji min hipotesis taburan terpencong. Penghasilan statistik ini melibatkan pengubahsuaian pemboleh ubah . Kajian simulasi yang dijalankan terhadap taburan yang terpencong iaitu taburan eksponen, kuasa dua khi dan Weibull ke atas Kebarangkalian Ralat Jenis I menunjukkan bahawa statistik t3 sesuai untuk ujian satu hujung sebelah kiri dan saiz sampel yang kecil (n=5). Kata kunci: Min; statistik; taburan terpencong; penganggar pembetulan kepincangan; kebarangkalian Ralat Jenis I A transformation of mean has been done using a bias correction estimator to produce a statistic for mean hypothesis of skewed distributions. The statistic found involves a modification of the variable . A simulation study that has been done on some skewed distributions i.e. esponential, chi-square and Weibull on the Type I Error shows that t3 is suitable for the left-tailed test and a small sample size (n=5). Key words: Mean; statistic; skewed distribution; bias correction estimator; Type I Error

Download Full-text

Estimation of odds ratios of genetic variants for the secondary phenotypes associated with primary diseases

Genetic Epidemiology ◽

10.1002/gepi.20568 ◽

2011 ◽

Vol 35 (3) ◽

pp. 190-200 ◽

Cited By ~ 24

Author(s):

Jian Wang ◽

Sanjay Shete

Keyword(s):

Genetic Variants ◽

Odds Ratios ◽

Secondary Phenotypes

Download Full-text

Further investigations of the W-test for pairwise epistasis testing

Wellcome Open Research ◽

10.12688/wellcomeopenres.11926.1 ◽

2017 ◽

Vol 2 ◽

pp. 54 ◽

Cited By ~ 1

Author(s):

Richard Howey ◽

Heather J. Cordell

Keyword(s):

Logistic Regression ◽

Computer Simulations ◽

Genetic Variants ◽

Degrees Of Freedom ◽

Type I Error ◽

Alternative Hypothesis ◽

Practical Implementation ◽

Type I ◽

Common Variants ◽

Higher Power

Background: In a recent paper, a novel W-test for pairwise epistasis testing was proposed that appeared, in computer simulations, to have higher power than competing alternatives. Application to genome-wide bipolar data detected significant epistasis between SNPs in genes of relevant biological function. Network analysis indicated that the implicated genes formed two separate interaction networks, each containing genes highly related to autism and neurodegenerative disorders. Methods: Here we investigate further the properties and performance of the W-test via theoretical evaluation, computer simulations and application to real data. Results: We demonstrate that, for common variants, the W-test is closely related to several existing tests of association allowing for interaction, including logistic regression on 8 degrees of freedom, although logistic regression can show inflated type I error for low minor allele frequencies, whereas the W-test shows good/conservative type I error control. Although in some situations the W-test can show higher power, logistic regression is not limited to tests on 8 degrees of freedom but can instead be tailored to impose greater structure on the assumed alternative hypothesis, offering a power advantage when the imposed structure matches the true structure. Conclusions: The W-test is a potentially useful method for testing for association - without necessarily implying interaction - between genetic variants disease, particularly when one or more of the genetic variants are rare. For common variants, the advantages of the W-test are less clear, and, indeed, there are situations where existing methods perform better. In our investigations, we further uncover a number of problems with the practical implementation and application of the W-test (to bipolar disorder) previously described, apparently due to inadequate use of standard data quality-control procedures. This observation leads us to urge caution in interpretation of the previously-presented results, most of which we consider are highly likely to be artefacts.

Download Full-text

Mendelian Randomization With Refined Instrumental Variables From Genetic Score Improves Accuracy and Reduces Bias

Frontiers in Genetics ◽

10.3389/fgene.2021.618829 ◽

2021 ◽

Vol 12 ◽

Author(s):

Lijuan Lin ◽

Ruyang Zhang ◽

Hui Huang ◽

Ying Zhu ◽

Yi Li ◽

...

Keyword(s):

Genetic Variants ◽

Statistical Power ◽

Complex Disease ◽

Type I Error ◽

Mendelian Randomization ◽

Causal Effect ◽

Type I ◽

Individual Data ◽

Genetic Score ◽

Mediation Effects

Mendelian randomization (MR) can estimate the causal effect for a risk factor on a complex disease using genetic variants as instrument variables (IVs). A variety of generalized MR methods have been proposed to integrate results arising from multiple IVs in order to increase power. One of the methods constructs the genetic score (GS) by a linear combination of the multiple IVs using the multiple regression model, which was applied in medical researches broadly. However, GS-based MR requires individual-level data, which greatly limit its application in clinical research. We propose an alternative method called Mendelian Randomization with Refined Instrumental Variable from Genetic Score (MR-RIVER) to construct a genetic IV by integrating multiple genetic variants based on summarized results, rather than individual data. Compared with inverse-variance weighted (IVW) and generalized summary-data-based Mendelian randomization (GSMR), MR-RIVER maintained the type I error, while possessing more statistical power than the competing methods. MR-RIVER also presented smaller biases and mean squared errors, compared to the IVW and GSMR. We further applied the proposed method to estimate the effects of blood metabolites on educational attainment, by integrating results from several publicly available resources. MR-RIVER provided robust results under different LD prune criteria and identified three metabolites associated with years of schooling and additional 15 metabolites with indirect mediation effects through butyrylcarnitine. MR-RIVER, which extends score-based MR to summarized results in lieu of individual data and incorporates multiple correlated IVs, provided a more accurate and powerful means for the discovery of novel risk factors.

Download Full-text

An optimal kernel-based method for gene set association analysis

10.1101/304055 ◽

2018 ◽

Author(s):

Tao He ◽

Shaoyu Li ◽

Ping-Shou Zhong ◽

Yuehua Cui

Keyword(s):

Genetic Variants ◽

Complex Traits ◽

Type I Error ◽

Association Studies ◽

P Value ◽

Systematic Effect ◽

Type I ◽

Genome Wide Association Studies ◽

Optimal Kernel ◽

The One

ABSTRACTSingle-variant based genome-wide association studies have successfully detected many genetic variants that are associated with many complex traits. However, their power is limited due to weak marginal signals and ignoring potential complex interactions among genetic variants. Set-based strategy was proposed to provide a remedy where multiple genetic variants in a given set (e.g., gene or pathway) are jointly evaluated, so that the systematic effect of the set is considered. Among many, the kernel-based testing (KBT) framework is one of the most popular and powerful methods in set-based association studies. Given a set of candidate kernels, method has been proposed to choose the one with the smallest p-value. Such a method, however, can yield inflated type I error, especially when the number of variants in a set is large. Alternatively one can get p-values by permutations which, however, could be very time consuming. In this work, we proposed an efficient testing procedure that can not only control type I error rate but also generate power close to the one obtained under the optimal kernel. Our method is built upon the KBT framework and is based on asymptotic results under a high-dimensional setting. Hence it can efficiently deal with the case where the number of variants in a set is much larger than the sample size. Both simulation and real data analysis demonstrate the advantages of the method compared with its counterparts.

Download Full-text

Improving the accuracy of two-sample summary data Mendelian randomization: moving beyond the NOME assumption

10.1101/159442 ◽

2017 ◽

Cited By ~ 17

Author(s):

Jack Bowden ◽

Fabiola Del Greco M ◽

Cosetta Minelli ◽

Qingyuan Zhao ◽

Debbie A Lawlor ◽

...

Keyword(s):

Genetic Variants ◽

Type I Error ◽

Mendelian Randomization ◽

Disease Risk ◽

Causal Effect ◽

Meta Analysis ◽

Type I ◽

Weak Instruments ◽

Using Data ◽

Summary Data

AbstractBackgroundTwo-sample summary data Mendelian randomization (MR) incorporating multiple genetic variants within a meta-analysis framework is a popular technique for assessing causality in epidemiology. If all genetic variants satisfy the instrumental variable (IV) and necessary modelling assumptions, then their individual ratio estimates of causal effect should be homogeneous. Observed heterogeneity signals that one or more of these assumptions could have been violated.MethodsCausal estimation and heterogeneity assessment in MR requires an approximation for the variance, or equivalently the inverse-variance weight, of each ratio estimate. We show that the most popular ‘1st order’ weights can lead to an inflation in the chances of detecting heterogeneity when in fact it is not present. Conversely, ostensibly more accurate ‘2nd order’ weights can dramatically increase the chances of failing to detect heterogeneity, when it is truly present. We derive modified weights to mitigate both of these adverse effects.ResultsUsing Monte Carlo simulations, we show that the modified weights outperform 1st and 2nd order weights in terms of heterogeneity quantification. Modified weights are also shown to remove the phenomenon of regression dilution bias in MR estimates obtained from weak instruments, unlike those obtained using 1st and 2nd order weights. However, with small numbers of weak instruments, this comes at the cost of a reduction in estimate precision and power to detect a causal effect compared to 1st order weighting. Moreover, 1st order weights always furnish unbiased estimates and preserve the type I error rate under the causal null. We illustrate the utility of the new method using data from a recent two-sample summary data MR analysis to assess the causal role of systolic blood pressure on coronary heart disease risk.ConclusionsWe propose the use of modified weights within two-sample summary data MR studies for accurately quantifying heterogeneity and detecting outliers in the presence of weak instruments. Modified weights also have an important role to play in terms of causal estimation (in tandem with 1st order weights) but further research is required to understand their strengths and weaknesses in specific settings.

Download Full-text

Associating Multivariate Traits with Genetic Variants Using Collapsing and Kernel Methods with Pedigree- or Population-Based Studies

Computational and Mathematical Methods in Medicine ◽

10.1155/2021/8812282 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Li-Chu Chien

Keyword(s):

Kernel Methods ◽

Genetic Variants ◽

Statistical Power ◽

Type I Error ◽

Population Based ◽

Error Rates ◽

Type I ◽

Omnibus Test ◽

Multifactorial Diseases ◽

Multivariate Traits

In genetic association analysis, several relevant phenotypes or multivariate traits with different types of components are usually collected to study complex or multifactorial diseases. Over the past few years, jointly testing for association between multivariate traits and multiple genetic variants has become more popular because it can increase statistical power to identify causal genes in pedigree- or population-based studies. However, most of the existing methods mainly focus on testing genetic variants associated with multiple continuous phenotypes. In this investigation, we develop a framework for identifying the pleiotropic effects of genetic variants on multivariate traits by using collapsing and kernel methods with pedigree- or population-structured data. The proposed framework is applicable to the burden test, the kernel test, and the omnibus test for autosomes and the X chromosome. The proposed multivariate trait association methods can accommodate continuous phenotypes or binary phenotypes and further can adjust for covariates. Simulation studies show that the performance of our methods is satisfactory with respect to the empirical type I error rates and power rates in comparison with the existing methods.

Download Full-text