scholarly journals EnsembleCNV: An ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data

2018 ◽  
Author(s):  
Zhongyang Zhang ◽  
Haoxiang Cheng ◽  
Xiumei Hong ◽  
Antonio F. Di Narzo ◽  
Oscar Franzen ◽  
...  

ABSTRACTThe associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV a) identifies and eliminates batch effects at raw data level; b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; d) refines CNVR boundaries by local correlation structure in copy number intensities; e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases.

2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 243-244
Author(s):  
Brittany N Diehl ◽  
Andres A Pech-Cervantes ◽  
Thomas H Terrill ◽  
Ibukun M Ogunade ◽  
Owen Rae ◽  
...  

Abstract Florida Native sheep is an indigenous breed from Florida and expresses superior parasite resistance. Previous candidate and genome wide association studies with Florida Native sheep have identified single nucleotide polymorphisms with additive and non-additive effects associated with parasite resistance. However, the role of other potential DNA variants, such as copy number variants (CNVs), controlling this complex trait have not been evaluated. The objective of the present study was to investigate the importance of CNVs on resistance to natural Haemonchus contortus infections in Florida Native sheep. A total of 200 sheep were evaluated in the present study. Phenotypic records included fecal egg count (FEC, eggs/gram), FAMACHA score, and packed cell volume (PCV, %). Sheep were genotyped using the GGP Ovine 50K SNP chip. The copy number analysis was used to identify CNVs using the univariate method. A total of 170 animals with CNVs and phenotypic data were used for the association testing. Association tests were carried out using single linear regression and Principal Component Analysis (PCA) correction to identify CNVs associated with FEC, FAMACHA, and PCV. To confirm our results, a second association testing using the correlation-trend test with PCA correction was performed. Significant CNVs were detected when their adjusted p-value was < 0.05 after FDR correction. A deletion CNV in chromosome 21 was associated with FEC. This DNA variant was located in intron 2 of RAB3IL gene and overlapped a QTL associated with changes in eosinophil number. Our study demonstrated for the first time that CNVs could be potentially involved with parasite resistance in this heritage sheep breed.


2017 ◽  
Author(s):  
Zilu Zhou ◽  
Weixin Wang ◽  
Li-San Wang ◽  
Nancy Ruonan Zhang

AbstractMotivationCopy number variations (CNVs) are gains and losses of DNA segments and have been associated with disease. Many large-scale genetic association studies are performing CNV analysis using whole exome sequencing (WES) and whole genome sequencing (WGS). In many of these studies, previous SNP-array data are available. An integrated cross-platform analysis is expected to improve resolution and accuracy, yet there is no tool for effectively combining data from sequencing and array platforms. The detection of CNVs using sequencing data alone can also be further improved by the utilization of allele-specific reads.ResultsWe propose a statistical framework, integrated Copy Number Variation detection algorithm (iCNV), which can be applied to multiple study designs: WES only, WGS only, SNP array only, or any combination of SNP and sequencing data. iCNV applies platform specific normalization, utilizes allele specific reads from sequencing and integrates matched NGS and SNP-array data by a Hidden Markov Model (HMM). We compare integrated two-platform CNV detection using iCNV to naive intersection or union of platforms and show that iCNV increases sensitivity and robustness. We also assess the accuracy of iCNV on WGS data only, and show that the utilization of allele-specific reads improve CNV detection accuracy compared to existing methods.Availabilityhttps://github.com/zhouzilu/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.


Circulation ◽  
2012 ◽  
Vol 125 (suppl_10) ◽  
Author(s):  
Nora Franceschini ◽  
Ching-Ti Liu ◽  
W Linda Kao ◽  
Leslie Lange ◽  
Kari E North ◽  
...  

Smoking is a known risk factor for progression of chronic kidney disease (CKD) but little is known of the role of smoking exposure on genetic effects of variants influencing kidney traits in the general population. We examined the evidence for effect modification of current smoking on the association of single nucleotide polymorphisms (SNP) with estimated glomerular filtration rate (eGFR) and urine albumin to creatinine ratio (UACR), two well established markers of kidney disease, in 23,767 white and 8,110 African American individuals from five studies genotyped using the custom SNP array ITMAT-Broad-CARe (IBC array) in the CARe consortium. We obtained study- and race-specific residuals from linear regression models of natural log-transformed eGFR or UACR regressed on age, sex and study site. We then stratified residuals by current smoking exposure and performed genome wide association analyses using additive genetic models adjusted for 10 principal components, and accounting for family structure using mixed models, if needed. Meta-analyses across smoking-specific strata within each self-reported race were performed using the inverse variance weighted fixed effect models. We assessed smoking interaction using a heterogeneity test (P<0.10) and I 2 metric. Among SNPs reaching the array wide specific significance threshold (2.0x10 -6 ) for association with eGFR or UACR, there was significant between smoking-strata heterogeneity for rs7422339 ( CPS1 , P=0.03, I 2 =77.7%) and rs13333226 ( UMOD , P=0.06, I 2 =71.1%) for eGFR in whites, with larger decreases in eGFR among current smokers compared to past/never smokers. For UACR, rs1801239 (missense variant of CUBN , between smoking-strata heterogeneity P=0.09, I 2 =64.8%) T allele showed less protective effect among current smokers than non-smokers in whites only. These loci have been previously identified in genome wide association studies. Our findings, if replicated, suggest possible important interactions of smoking exposure on the genetic effects of known loci associated with kidney traits. Funding(This research has received full or partial funding support from the American Heart Association, National Center)


Genes ◽  
2019 ◽  
Vol 10 (5) ◽  
pp. 323 ◽  
Author(s):  
Whitaker ◽  
Ostrander

Each domestic dog breed is characterized by a strict set of physical and behavioral characteristics by which breed members are judged and rewarded in conformation shows. One defining feature of particular interest is the coat, which is comprised of either a double- or single-layer of hair. The top coat contains coarse guard hairs and a softer undercoat, similar to that observed in wolves and assumed to be the ancestral state. The undercoat is absent in single-coated breeds which is assumed to be the derived state. We leveraged single nucleotide polymorphism (SNP) array and whole genome sequence (WGS) data to perform genome-wide association studies (GWAS), identifying a locus on chromosome (CFA) 28 which is strongly associated with coat number. Using WGS data, we identified a locus of 18.4 kilobases containing 62 significant variants within the intron of a long noncoding ribonucleic acid (lncRNA) upstream of ADRB1. Multiple lines of evidence highlight the locus as a potential cis-regulatory module. Specifically, two variants are found at high frequency in single-coated dogs and are rare in wolves, and both are predicted to affect transcription factor (TF) binding. This report is among the first to exploit WGS data for both GWAS and variant mapping to identify a breed-defining trait.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
J. M. Belanger ◽  
T. R. Famula ◽  
L. C. Gershony ◽  
M. K. Palij ◽  
A. M. Oberbauer

Abstract Background Idiopathic epilepsy (IE) is a common neurological disorder in the domestic dog, and is defined as repeated seizure activity having no identifiable underlying cause. Some breeds, such as the Belgian shepherd dog, have a greater prevalence of the disorder. Previous studies in this and other breeds have identified ADAM23 as a gene that confers risk of IE, although additional loci are known to exist. The present study sought to identify additional loci that influence IE in the Belgian shepherd dog. Results Genome-wide association studies (GWAS) revealed a significant association between IE and CFA 14 (p < 1.03 E− 08) and a suggestive association on CFA 37 (p < 2.91 E− 06) in a region in linkage disequilibrium with ADAM23. Logistic regression identified a 2-loci model that demonstrated interaction between the two chromosomal regions that when combined predicted IE risk with high sensitivity. Conclusions Two interacting loci, one each on CFAs 14 and 37, predictive of IE in the Belgian shepherd were identified. The loci are adjacent to potential candidate genes associated with neurological function. Further exploration of the region is warranted to identify causal variants underlying the association. Additionally, although the two loci were very good at predicting IE, they failed to capture all the risk, indicating additional loci or incomplete penetrance are also likely contributing to IE expression in the Belgian shepherd dog.


2021 ◽  
Author(s):  
Tomas W Fitzgerald ◽  
Ewan Birney

Copy number variation (CNV) has long been known to influence human traits having a rich history of research into common and rare genetic disease and although CNV is accepted as an important class of genomic variation, progress on copy number (CN) phenotype associations from Next Generation Sequencing data (NGS) has been limited, in part, due to the relative difficulty in CNV detection and an enrichment for large numbers of false positives. To date most successful CN genome wide association studies (CN-GWAS) have focused on using predictive measures of dosage intolerance or gene burden tests to gain sufficient power for detecting CN effects. Here we present a novel method for large scale CN analysis from NGS data generating robust CN estimates and allowing CN-GWAS to be performed genome wide in discovery mode. We provide a detailed analysis in the large scale UK BioBank resource and a specifically designed software package for deriving CN estimates from NGS data that are robust enough to be used for CN-GWAS. We use these methods to perform genome wide CN-GWAS analysis across 78 human traits discovering 862 genetic associations that are likely to contribute strongly to trait distributions based solely on their CN or by acting in concert with other genetic variation. Finally, we undertake an analysis comparing CNV and SNP association signals across the same traits and samples, defining specific CNV association classes based on whether they could be detected using standard SNP-GWAS in the UK Biobank.


2020 ◽  
Author(s):  
Mika Sakurai-Yageta ◽  
Kazuki Kumada ◽  
Chinatsu Gocho ◽  
Satoshi Makino ◽  
Akira Uruno ◽  
...  

Abstract Background: Increasing the power of genome-wide association studies in diverse populations is important for understanding the genetic determinants of disease risks, and large-scale genotype data are collected by genome cohort and biobank projects all over the world. In particular, ethnic-specific SNP arrays are becoming more important because the use of universal SNP arrays has some limitations in terms of cost-effectiveness and throughput. As part of the Tohoku Medical Megabank Project, which integrates prospective genome cohorts into a biobank, we have been developing a series of Japonica Arrays for genotyping participants based on reference panels constructed from whole-genome sequence data of the Japanese population.Results: We designed a novel version of the SNP Array for the Japanese population, called Japonica Array NEO, comprising a total of 666,883 SNPs, including tag SNPs of autosomes and X chromosome with pseudoautosomal regions, SNPs of Y chromosome and mitochondria, and known disease risk SNPs. Among them, 654,246 tag SNPs were selected from an expanded reference panel of 3,552 Japanese using pairwise r2 of linkage disequilibrium measures. Moreover, 28,298 SNPs were included for the evaluation of previously identified disease risk SNPs from the literature and databases, and those present in the Japanese population were extracted using the reference panel. The imputation performance of Japonica Array NEO was assessed by genotyping 286 Japanese samples. We found that the imputation quality r2 and INFO score in the minor allele frequency bin >2.5%–5% were >0.9 and >0.8, respectively, and >12 million markers were imputed with an INFO score >0.8. After verification, Japonica Arrays were used to efficiently genotype cohort participants from the sample selection to perform a quality assessment of the raw data; approximately 130,000 genotyping data of >150,000 participants has already been obtained. Conclusions: Japonica Array NEO is a promising tool for genotyping the Japanese population with genome-wide coverage, contributing to the development of genetic risk scores for this population and further identifying disease risk alleles among individuals of East Asian ancestry.


Sign in / Sign up

Export Citation Format

Share Document