scholarly journals “Gap hunting” to characterize clustered probe signals in Illumina methylation array data

2016 ◽  
Author(s):  
Shan V. Andrews ◽  
Christine Ladd-Acosta ◽  
Andrew P. Feinberg ◽  
Kasper D. Hansen ◽  
M. Daniele Fallin

AbstractBackgroundThe Illumina 450K array has been widely used in epigenetic association studies. Current quality-control (QC) pipelines typically remove certain sets of probes, such as those containing a SNP or with multiple mapping locations. An additional set of potentially problematic probes are those with DNA methylation (DNAm) distributions characterized by two or more distinct clusters separated by gaps. Data-driven identification of such probes may offer additional insights for downstream analyses.ResultsWe developed a procedure, termed “gap hunting”, to identify probes showing clustered distributions. Among 590 peripheral blood samples from the Study to Explore Early Development, we identified 11,007 “gap probes”. The vast majority (9,199) are likely attributed to an underlying SNP(s) or other variant in the probe, although SNP-affected probes exist that do not produce a gap signals. Specific factors predict which SNPs lead to gap signals, including type of nucleotide change, probe type, DNA strand, and overall methylation state. These expected effects are demonstrated in paired genotype and 450k data on the same samples. Gap probes can also serve as a surrogate for the local genetic sequence on a haplotype scale and can be used to adjust for population stratification.ConclusionsThe characteristics of gap probes reflect potentially informative biology. QC pipelines may benefit from an efficient data-driven approach that “flags” gap probes, rather than filtering such probes, followed by careful interpretation of downstream association analyses. Our results should translate directly to the recently released Illumina 850K EPIC array given the similar chemistry and content design.

PLoS Genetics ◽  
2021 ◽  
Vol 17 (1) ◽  
pp. e1009315
Author(s):  
Ardalan Naseri ◽  
Junjie Shi ◽  
Xihong Lin ◽  
Shaojie Zhang ◽  
Degui Zhi

Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts.


BMC Genomics ◽  
2013 ◽  
Vol 14 (1) ◽  
pp. 293 ◽  
Author(s):  
Ruth Pidsley ◽  
Chloe C Y Wong ◽  
Manuela Volta ◽  
Katie Lunnon ◽  
Jonathan Mill ◽  
...  

2021 ◽  
Author(s):  
Matthew Hotradat

Ventricular arrhythmias (VA) are dangerous pathophysiological conditions affecting the heart which evolve over time resulting in different manifestations such as ventricular tachycardia (VT), organized VF (OVF), and disorganized VF (DVF). Success of resuscitation for patients is greatly impacted by the type of VA and swift administration of appropriate therapy options. This thesis attempts to arrive at computationally efficient, data driven approaches for classifying and tracking VAs over time for two purposes: (1) ‘in-hospital’ scenarios for planning long-term therapy options, and (2) ‘out-of-hospital’ scenarios for tracking progression/segregation of VAs in near real-time. Using a database of 61 60-s ECG VA segments, maximum classification accuracies of 96.7% (AUC=0.993) and 87% (AUC=0.968) were achieved for VT vs. VF and OVF vs. DVF classification for ‘in-hospital’/offline analysis. Two near real-time approaches were also developed for ‘out-of-hospital’ VA incidents with results demonstrating the high potential to track VA progression and segregation over time.


2021 ◽  
Author(s):  
Matthew Hotradat

Ventricular arrhythmias (VA) are dangerous pathophysiological conditions affecting the heart which evolve over time resulting in different manifestations such as ventricular tachycardia (VT), organized VF (OVF), and disorganized VF (DVF). Success of resuscitation for patients is greatly impacted by the type of VA and swift administration of appropriate therapy options. This thesis attempts to arrive at computationally efficient, data driven approaches for classifying and tracking VAs over time for two purposes: (1) ‘in-hospital’ scenarios for planning long-term therapy options, and (2) ‘out-of-hospital’ scenarios for tracking progression/segregation of VAs in near real-time. Using a database of 61 60-s ECG VA segments, maximum classification accuracies of 96.7% (AUC=0.993) and 87% (AUC=0.968) were achieved for VT vs. VF and OVF vs. DVF classification for ‘in-hospital’/offline analysis. Two near real-time approaches were also developed for ‘out-of-hospital’ VA incidents with results demonstrating the high potential to track VA progression and segregation over time.


2021 ◽  
Author(s):  
Cecilia E Thomas ◽  
Leo Dahl ◽  
Sanna Byström ◽  
Yan Chen ◽  
Mathias Uhlén ◽  
...  

Background: Risk prediction is crucial for early detection and prognosis of breast cancer. Circulating plasma proteins could provide a valuable source to increase the validity of risk prediction models, however, no such markers have yet been identified for clinical use. Methods: EDTA plasma samples from 183 breast cancer cases and 366 age-matched controls were collected prior to diagnosis from the Swedish breast cancer cohort KARMA. The samples were profiled on 700 circulating proteins using an exploratory affinity proteomics approach. Linear association analyses were performed on case-control status and a data-driven analysis strategy was applied to cluster the women on their plasma proteome profiles in an unsupervised manner. The resulting clusters were subsequently annotated for the differences in phenotypic characteristics, clinical parameters, and genetic risk. Results: Using the data-driven approach we identified five clusters with distinct proteomic plasma profiles. Women in a particular sub-group (cluster 1) were significantly more likely to have used menopausal hormonal therapy (MHT), more likely to get a breast cancer diagnosis, and were older compared to the remaining clusters. The levels of circulating proteins in cluster 1 were decreased for proteins related to DNA repair and cell replication and increased for proteins related to mammographic density and female tissues. In contrast, classical dichotomous case-control analyses did not reveal any proteins significantly associated with future breast cancer. Conclusion: Using a data-driven approach, we identified a subset of women with circulating proteins associated with previous use of MHT and risk of breast cancer. Our findings point to the potential long-lasting effects of MHT on the circulating proteome even after ending the treatment, and hence provide valuable insights concerning risk predication of breast cancer.


2012 ◽  
Author(s):  
Michael Ghil ◽  
Mickael D. Chekroun ◽  
Dmitri Kondrashov ◽  
Michael K. Tippett ◽  
Andrew Robertson ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document