Robust Linear Trend Test for Low-Coverage Next-Generation Sequence Data Controlling for Covariates

Jung Yeon Lee; Myeong-Kyu Kim; Wonkuk Kim

doi:10.3390/math8020217

Robust Linear Trend Test for Low-Coverage Next-Generation Sequence Data Controlling for Covariates

Mathematics ◽

10.3390/math8020217 ◽

2020 ◽

Vol 8 (2) ◽

pp. 217

Author(s):

Jung Yeon Lee ◽

Myeong-Kyu Kim ◽

Wonkuk Kim

Keyword(s):

Next Generation Sequencing ◽

Statistical Power ◽

Type I Error ◽

Error Rates ◽

Association Test ◽

Trend Test ◽

Type I ◽

Next Generation ◽

Low Coverage ◽

Generation Sequencing

Low-coverage next-generation sequencing experiments assisted by statistical methods are popular in a genetic association study. Next-generation sequencing experiments produce genotype data that include allele read counts and read depths. For low sequencing depths, the genotypes tend to be highly uncertain; therefore, the uncertain genotypes are usually removed or imputed before performing a statistical analysis. It may result in the inflated type I error rate and in a loss of statistical power. In this paper, we propose a mixture-based penalized score association test adjusting for non-genetic covariates. The proposed score test statistic is based on a sandwich variance estimator so that it is robust under the model misspecification between the covariates and the latent genotypes. The proposed method takes advantage of not requiring either external imputation or elimination of uncertain genotypes. The results of our simulation study show that the type I error rates are well controlled and the proposed association test have reasonable statistical power. As an illustration, we apply our statistic to pharmacogenomics data for drug responsiveness among 400 epilepsy patients.

Download Full-text

Design and Statistical Analysis of Pooled Next Generation Sequencing for Rare Variants

Journal of Probability and Statistics ◽

10.1155/2012/524724 ◽

2012 ◽

Vol 2012 ◽

pp. 1-19 ◽

Cited By ~ 3

Author(s):

Tao Wang ◽

Chang-Yun Lin ◽

Yuanhao Zhang ◽

Ruofeng Wen ◽

Kenny Ye

Keyword(s):

Next Generation Sequencing ◽

Error Rate ◽

Statistical Power ◽

Rare Variants ◽

Good Control ◽

Testing Procedure ◽

Sequencing Error ◽

Type I ◽

Next Generation ◽

Generation Sequencing

Next generation sequencing (NGS) is a revolutionary technology for biomedical research. One highly cost-efficient application of NGS is to detect disease association based on pooled DNA samples. However, several key issues need to be addressed for pooled NGS. One of them is the high sequencing error rate and its high variability across genomic positions and experiment runs, which, if not well considered in the experimental design and analysis, could lead to either inflated false positive rates or loss in statistical power. Another important issue is how to test association of a group of rare variants. To address the first issue, we proposed a new blocked pooling design in which multiple pools of DNA samples from cases and controls are sequenced together on same NGS functional units. To address the second issue, we proposed a testing procedure that does not require individual genotypes but by taking advantage of multiple DNA pools. Through a simulation study, we demonstrated that our approach provides a good control of the type I error rate, and yields satisfactory power compared to the test-based on individual genotypes. Our results also provide guidelines for designing an efficient pooled.

Download Full-text

Analysis in case–control sequencing association studies with different sequencing depths

Biostatistics ◽

10.1093/biostatistics/kxy073 ◽

2018 ◽

Vol 21 (3) ◽

pp. 577-593

Author(s):

Sixing Chen ◽

Xihong Lin

Keyword(s):

Next Generation Sequencing ◽

Type I Error ◽

Association Studies ◽

Likelihood Method ◽

Next Generation Sequencing Data ◽

Type I ◽

Next Generation ◽

Sequencing Data ◽

Data Set ◽

Generation Sequencing

Summary With the advent of next-generation sequencing, investigators have access to higher quality sequencing data. However, to sequence all samples in a study using next generation sequencing can still be prohibitively expensive. One potential remedy could be to combine next generation sequencing data from cases with publicly available sequencing data for controls, but there could be a systematic difference in quality of sequenced data, such as sequencing depths, between sequenced study cases and publicly available controls. We propose a regression calibration (RC)-based method and a maximum-likelihood method for conducting an association study with such a combined sample by accounting for differential sequencing errors between cases and controls. The methods allow for adjusting for covariates, such as population stratification as confounders. Both methods control type I error and have comparable power to analysis conducted using the true genotype with sufficiently high but different sequencing depths. We show that the RC method allows for analysis using naive variance estimate (closely approximates true variance in practice) and standard software under certain circumstances. We evaluate the performance of the proposed methods using simulation studies and apply our methods to a combined data set of exome sequenced acute lung injury cases and healthy controls from the 1000 Genomes project.

Download Full-text

Frequency detection of BRAF V600E mutation in a cohort of pediatric langerhans cell histiocytosis patients by next-generation sequencing

Orphanet Journal of Rare Diseases ◽

10.1186/s13023-021-01912-3 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Shunqiao Feng ◽

Lin Han ◽

Mei Yue ◽

Dixiao Zhong ◽

Jing Cao ◽

...

Keyword(s):

Next Generation Sequencing ◽

Langerhans Cell Histiocytosis ◽

Mutation Screening ◽

Langerhans Cell ◽

Braf V600e ◽

P Value ◽

Type I ◽

Next Generation ◽

Mutation Status ◽

Generation Sequencing

Abstract Background Langerhans cell histiocytosis (LCH) is a rare neoplastic disease that occurs in both children and adults, and BRAF V600E is detected in up to 64% of the patients. Several studies have discussed the associations between BRAF V600E mutation and clinicopathological manifestations, but no clear conclusions have been drawn regarding the clinical significance of the mutation in pediatric patients. Results We retrieved the clinical information for 148 pediatric LCH patients and investigated the BRAF V600E mutation using next-generation sequencing alone or with droplet digital PCR. The overall positive rate of BRAF V600E was 60/148 (41%). The type of sample (peripheral blood and formalin-fixed paraffin-embedded tissue) used for testing was significantly associated with the BRAF V600E mutation status (p-value = 0.000 and 0.000). The risk of recurrence declined in patients who received targeted therapy (p-value = 0.006; hazard ratio 0.164, 95%CI: 0.046 to 0.583). However, no correlation was found between the BRAF V600E status and gender, age, stage, specific organ affected, TP53 mutation status, masses close to the lesion or recurrence. Conclusions This is the largest pediatric LCH study conducted with a Chinese population to date. BRAF V600E in LCH may occur less in East Asian populations than in other ethnic groups, regardless of age. Biopsy tissue is a more sensitive sample for BRAF mutation screening because not all of circulating DNA is tumoral. Approaches with low limit of detection or high sensitivity are recommended for mutation screening to avoid type I and II errors.

Download Full-text

High throughput crop genome genotyping by a combination of pool next generation sequencing and haplotype-based data processing

10.21203/rs.3.rs-415602/v1 ◽

2021 ◽

Author(s):

Michael Schneider ◽

Asis Shrestha ◽

Agim Ballvora ◽

Jens Leon

Keyword(s):

Next Generation Sequencing ◽

Allele Frequency ◽

Frequency Estimation ◽

Whole Genome ◽

Next Generation ◽

Conservation Genomics ◽

High Coverage ◽

Allele Frequency Estimation ◽

Low Coverage ◽

Generation Sequencing

Abstract BackgroundThe identification of environmentally specific alleles and the observation of evolutional processes is a goal of conservation genomics. By generational changes of allele frequencies in populations, questions regarding effective population size, gene flow, drift, and selection can be addressed. The observation of such effects often is a trade-off of costs and resolution, when a decent sample of genotypes should be genotyped for many loci. Pool genotyping approaches can derive a high resolution and precision in allele frequency estimation, when high coverage sequencing is utilized. Still, pool high coverage pool sequencing of big genomes comes along with high costs.ResultsHere we present a reliable method to estimate a barley population’s allele frequency at low coverage sequencing. Three hundred genotypes were sampled from a barley backcross population to estimate the entire population’s allele frequency. The allele frequency estimation accuracy and yield were compared for three next generation sequencing methods. To reveal accurate allele frequency estimates on a low coverage sequencing level, a haplotyping approach was performed. Low coverage allele frequency of positional connected single polymorphisms were aggregated to a single haplotype allele frequency, resulting in two to 271 times higher depth and increased precision. We compared different haplotyping tactics, showing that gene and chip marker-based haplotypes perform on par or better than simple contig haplotype windows. The comparison of multiple pool samples and the referencing against an individual sequencing approach revealed whole genome pool resequencing having the highest correlation to individual genotyping (up to 0.97), while transcriptomics and genotyping by sequencing indicated higher error rates and lower correlations.ConclusionUsing the proposed method allows to identify the allele frequency of populations with high accuracy at low cost. This is particularly interesting for conservation genomics in species with big genomes, like barley or wheat. Whole genome low coverage resequencing at 10x coverage can deliver a highly accurate estimation of the allele frequency, when a loci-based haplotyping approach is applied. Using annotated haplotypes allows to capitalize from biological background and statistical robustness.

Download Full-text

PhredEM: A Phred-Score-Informed Genotype-Calling Approach for Next-Generation Sequencing Studies

10.1101/046136 ◽

2016 ◽

Author(s):

Peizhou Liao ◽

Glen A. Satten ◽

Yi-juan Hu

Keyword(s):

Logistic Regression ◽

Next Generation Sequencing ◽

Em Algorithm ◽

Error Rates ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Genotype Calling ◽

Sequencing Studies ◽

Generation Sequencing

ABSTRACTA fundamental challenge in analyzing next-generation sequencing data is to determine an individual’s genotype correctly as the accuracy of the inferred genotype is essential to downstream analyses. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too-high threshold may lose data while a too-low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The algorithm, which we call PhredEM, uses the Expectation-Maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. We also develop a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be non-monomorphic require application of the EM algorithm. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project. The results demonstrate that PhredEM is an improved, robust and widely applicable genotype-calling approach for next-generation sequencing studies. The relevant software is freely available.

Download Full-text

Optimal selection of genetic variants for adjustment of population stratification in European association studies

Briefings in Bioinformatics ◽

10.1093/bib/bbz023 ◽

2019 ◽

Vol 21 (3) ◽

pp. 753-761 ◽

Cited By ~ 2

Author(s):

Regina Brinster ◽

Dominique Scherer ◽

Justo Lorenzo Bermejo

Keyword(s):

Genetic Variants ◽

Population Stratification ◽

Statistical Power ◽

Type I Error ◽

Association Studies ◽

Reference Sample ◽

Error Rates ◽

The Cancer Genome Atlas ◽

Type I ◽

Genotype Data

Abstract Population stratification is usually corrected relying on principal component analysis (PCA) of genome-wide genotype data, even in populations considered genetically homogeneous, such as Europeans. The need to genotype only a small number of genetic variants that show large differences in allele frequency among subpopulations—so-called ancestry-informative markers (AIMs)—instead of the whole genome for stratification adjustment could represent an advantage for replication studies and candidate gene/pathway studies. Here we compare the correction performance of classical and robust principal components (PCs) with the use of AIMs selected according to four different methods: the informativeness for assignment measure ($IN$-AIMs), the combination of PCA and F-statistics, PCA-correlated measurement and the PCA weighted loadings for each genetic variant. We used real genotype data from the Population Reference Sample and The Cancer Genome Atlas to simulate European genetic association studies and to quantify type I error rate and statistical power in different case–control settings. In studies with the same numbers of cases and controls per country and control-to-case ratios reflecting actual rates of disease prevalence, no adjustment for population stratification was required. The unnecessary inclusion of the country of origin, PCs or AIMs as covariates in the regression models translated into increasing type I error rates. In studies with cases and controls from separate countries, no investigated method was able to adequately correct for population stratification. The first classical and the first two robust PCs achieved the lowest (although inflated) type I error, followed at some distance by the first eight $IN$-AIMs.

Download Full-text

Next-Generation Sequencing Data Analysis on Pool-Seq and Low-Coverage Retinoblastoma Data

Interdisciplinary Sciences Computational Life Sciences ◽

10.1007/s12539-020-00374-8 ◽

2020 ◽

Vol 12 (3) ◽

pp. 302-310

Author(s):

Gülistan Özdemir Özdoğan ◽

Hilal Kaya

Keyword(s):

Data Analysis ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Low Coverage ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

A comparative analysis of cell-type adjustment methods for epigenome-wide association studies based on simulated and real data sets

Briefings in Bioinformatics ◽

10.1093/bib/bby068 ◽

2018 ◽

Vol 20 (6) ◽

pp. 2055-2065 ◽

Cited By ~ 1

Author(s):

Johannes Brägelmann ◽

Justo Lorenzo Bermejo

Keyword(s):

Statistical Power ◽

Type I Error ◽

Association Studies ◽

Real Data ◽

Error Rates ◽

Data Sets ◽

Type I ◽

Cell Type ◽

Type I Error Rates

Abstract Technological advances and reduced costs of high-density methylation arrays have led to an increasing number of association studies on the possible relationship between human disease and epigenetic variability. DNA samples from peripheral blood or other tissue types are analyzed in epigenome-wide association studies (EWAS) to detect methylation differences related to a particular phenotype. Since information on the cell-type composition of the sample is generally not available and methylation profiles are cell-type specific, statistical methods have been developed for adjustment of cell-type heterogeneity in EWAS. In this study we systematically compared five popular adjustment methods: the factored spectrally transformed linear mixed model (FaST-LMM-EWASher), the sparse principal component analysis algorithm ReFACTor, surrogate variable analysis (SVA), independent SVA (ISVA) and an optimized version of SVA (SmartSVA). We used real data and applied a multilayered simulation framework to assess the type I error rate, the statistical power and the quality of estimated methylation differences according to major study characteristics. While all five adjustment methods improved false-positive rates compared with unadjusted analyses, FaST-LMM-EWASher resulted in the lowest type I error rate at the expense of low statistical power. SVA efficiently corrected for cell-type heterogeneity in EWAS up to 200 cases and 200 controls, but did not control type I error rates in larger studies. Results based on real data sets confirmed simulation findings with the strongest control of type I error rates by FaST-LMM-EWASher and SmartSVA. Overall, ReFACTor, ISVA and SmartSVA showed the best comparable statistical power, quality of estimated methylation differences and runtime.

Download Full-text

Genetic analyses of the NF1 gene in Turkish neurofibromatosis type I patients and definition of three novel variants

Balkan Journal of Medical Genetics ◽

10.1515/bjmg-2017-0008 ◽

2017 ◽

Vol 20 (1) ◽

pp. 13-20 ◽

Cited By ~ 3

Author(s):

SD Ulusal ◽

H Gürkan ◽

E Atlı ◽

SA Özal ◽

M Çiftdemir ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Diagnosis ◽

Neurofibromatosis Type ◽

Type I ◽

Neurofibromatosis Type I ◽

Next Generation ◽

Loss Of Function ◽

Pathogenic Variants ◽

Nf1 Gene ◽

Generation Sequencing

Abstract Neurofibromatosis Type I (NF1) is a multi systemic autosomal dominant neurocutaneous disorder predisposing patients to have benign and/or malignant lesions predominantly of the skin, nervous system and bone. Loss of function mutations or deletions of the NF1 gene is responsible for NF1 disease. Involvement of various pathogenic variants, the size of the gene and presence of pseudogenes makes it difficult to analyze. We aimed to report the results of 2 years of multiplex ligation-dependent probe amplification (MLPA) and next generation sequencing (NGS) for genetic diagnosis of NF1 applied at our genetic diagnosis center. The MLPA, semiconductor sequencing and Sanger sequencing were performed in genomic DNA samples from 24 unrelated patients and their affected family members referred to our center suspected of having NF1. In total, three novel and 12 known pathogenic variants and a whole gene deletion were determined. We suggest that next generation sequencing is a practical tool for genetic analysis of NF1. Deletion/duplication analysis with MLPA may also be helpful for patients clinically diagnosed to carry NF1 but do not have a detectable mutation in NGS.

Download Full-text

Statistical model specification and power: recommendations on the use of test-qualified pooling in analysis of experimental data

Proceedings of The Royal Society B Biological Sciences ◽

10.1098/rspb.2016.1850 ◽

2017 ◽

Vol 284 (1851) ◽

pp. 20161850 ◽

Cited By ~ 7

Author(s):

Nick Colegrave ◽

Graeme D. Ruxton

Keyword(s):

Experimental Data ◽

Statistical Model ◽

Statistical Power ◽

Error Term ◽

Degrees Of Freedom ◽

Type I Error ◽

Error Rates ◽

Statistical Testing ◽

Model Specification ◽

Type I

A common approach to the analysis of experimental data across much of the biological sciences is test-qualified pooling. Here non-significant terms are dropped from a statistical model, effectively pooling the variation associated with each removed term with the error term used to test hypotheses (or estimate effect sizes). This pooling is only carried out if statistical testing on the basis of applying that data to a previous more complicated model provides motivation for this model simplification; hence the pooling is test-qualified. In pooling, the researcher increases the degrees of freedom of the error term with the aim of increasing statistical power to test their hypotheses of interest. Despite this approach being widely adopted and explicitly recommended by some of the most widely cited statistical textbooks aimed at biologists, here we argue that (except in highly specialized circumstances that we can identify) the hoped-for improvement in statistical power will be small or non-existent, and there is likely to be much reduced reliability of the statistical procedures through deviation of type I error rates from nominal levels. We thus call for greatly reduced use of test-qualified pooling across experimental biology, more careful justification of any use that continues, and a different philosophy for initial selection of statistical models in the light of this change in procedure.

Download Full-text