“Gap hunting” to characterize clustered probe signals in Illumina methylation array data

Mapping Intimacies ◽

10.1101/059659 ◽

2016 ◽

Author(s):

Shan V. Andrews ◽

Christine Ladd-Acosta ◽

Andrew P. Feinberg ◽

Kasper D. Hansen ◽

M. Daniele Fallin

Keyword(s):

Association Studies ◽

Data Driven ◽

Probe Type ◽

450K Array ◽

Methylation Array ◽

Association Analyses ◽

Data Driven Approach ◽

Efficient Data ◽

Dna Strand ◽

Clustered Distributions

AbstractBackgroundThe Illumina 450K array has been widely used in epigenetic association studies. Current quality-control (QC) pipelines typically remove certain sets of probes, such as those containing a SNP or with multiple mapping locations. An additional set of potentially problematic probes are those with DNA methylation (DNAm) distributions characterized by two or more distinct clusters separated by gaps. Data-driven identification of such probes may offer additional insights for downstream analyses.ResultsWe developed a procedure, termed “gap hunting”, to identify probes showing clustered distributions. Among 590 peripheral blood samples from the Study to Explore Early Development, we identified 11,007 “gap probes”. The vast majority (9,199) are likely attributed to an underlying SNP(s) or other variant in the probe, although SNP-affected probes exist that do not produce a gap signals. Specific factors predict which SNPs lead to gap signals, including type of nucleotide change, probe type, DNA strand, and overall methylation state. These expected effects are demonstrated in paired genotype and 450k data on the same samples. Gap probes can also serve as a surrogate for the local genetic sequence on a haplotype scale and can be used to adjust for population stratification.ConclusionsThe characteristics of gap probes reflect potentially informative biology. QC pipelines may benefit from an efficient data-driven approach that “flags” gap probes, rather than filtering such probes, followed by careful interpretation of downstream association analyses. Our results should translate directly to the recently released Illumina 850K EPIC array given the similar chemistry and content design.

Download Full-text

RAFFI: Accurate and fast familial relationship inference in large scale biobank studies using RaPID

PLoS Genetics ◽

10.1371/journal.pgen.1009315 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1009315

Author(s):

Ardalan Naseri ◽

Junjie Shi ◽

Xihong Lin ◽

Shaojie Zhang ◽

Degui Zhi

Keyword(s):

Large Scale ◽

Association Studies ◽

Scale Up ◽

Data Driven ◽

Genome Wide Association Studies ◽

Inference Method ◽

Genome Wide ◽

Familial Relationship ◽

Kinship Coefficients ◽

Data Driven Approach

Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts.

Download Full-text

A data-driven approach to preprocessing Illumina 450K methylation array data

BMC Genomics ◽

10.1186/1471-2164-14-293 ◽

2013 ◽

Vol 14 (1) ◽

pp. 293 ◽

Cited By ~ 490

Author(s):

Ruth Pidsley ◽

Chloe C Y Wong ◽

Manuela Volta ◽

Katie Lunnon ◽

Jonathan Mill ◽

...

Keyword(s):

Data Driven ◽

Methylation Array ◽

Array Data ◽

Illumina 450K ◽

Data Driven Approach ◽

450K Methylation

Download Full-text

Exposome-Wide Association Studies: A Data-Driven Approach for Searching for Exposures Associated with Phenotype

Unraveling the Exposome ◽

10.1007/978-3-319-89321-1_12 ◽

2018 ◽

pp. 315-336

Author(s):

Chirag J. Patel

Keyword(s):

Association Studies ◽

Data Driven ◽

Data Driven Approach

Download Full-text

Classification of Ventricular Arrhythmias Based on a Data Driven Approach

10.32920/ryerson.14657901 ◽

2021 ◽

Author(s):

Matthew Hotradat

Keyword(s):

Real Time ◽

Ventricular Arrhythmias ◽

Data Driven ◽

Term Therapy ◽

Computationally Efficient ◽

Therapy Options ◽

Data Driven Approach ◽

Efficient Data ◽

Long Term Therapy ◽

Over Time

Ventricular arrhythmias (VA) are dangerous pathophysiological conditions affecting the heart which evolve over time resulting in different manifestations such as ventricular tachycardia (VT), organized VF (OVF), and disorganized VF (DVF). Success of resuscitation for patients is greatly impacted by the type of VA and swift administration of appropriate therapy options. This thesis attempts to arrive at computationally efficient, data driven approaches for classifying and tracking VAs over time for two purposes: (1) ‘in-hospital’ scenarios for planning long-term therapy options, and (2) ‘out-of-hospital’ scenarios for tracking progression/segregation of VAs in near real-time. Using a database of 61 60-s ECG VA segments, maximum classification accuracies of 96.7% (AUC=0.993) and 87% (AUC=0.968) were achieved for VT vs. VF and OVF vs. DVF classification for ‘in-hospital’/offline analysis. Two near real-time approaches were also developed for ‘out-of-hospital’ VA incidents with results demonstrating the high potential to track VA progression and segregation over time.

Download Full-text

Classification of Ventricular Arrhythmias Based on a Data Driven Approach

10.32920/ryerson.14657901.v1 ◽

2021 ◽

Author(s):

Matthew Hotradat

Keyword(s):

Real Time ◽

Ventricular Arrhythmias ◽

Data Driven ◽

Term Therapy ◽

Computationally Efficient ◽

Therapy Options ◽

Data Driven Approach ◽

Efficient Data ◽

Long Term Therapy ◽

Over Time

Download Full-text

Circulating proteins reveal prior use of menopausal hormonal therapy and increased risk of breast cancer

10.1101/2021.05.20.444934 ◽

2021 ◽

Author(s):

Cecilia E Thomas ◽

Leo Dahl ◽

Sanna Byström ◽

Yan Chen ◽

Mathias Uhlén ◽

...

Keyword(s):

Breast Cancer ◽

Risk Prediction ◽

Hormonal Therapy ◽

Prediction Models ◽

Breast Cancer Diagnosis ◽

Case Control ◽

Data Driven ◽

Association Analyses ◽

Increased Risk ◽

Data Driven Approach

Background: Risk prediction is crucial for early detection and prognosis of breast cancer. Circulating plasma proteins could provide a valuable source to increase the validity of risk prediction models, however, no such markers have yet been identified for clinical use. Methods: EDTA plasma samples from 183 breast cancer cases and 366 age-matched controls were collected prior to diagnosis from the Swedish breast cancer cohort KARMA. The samples were profiled on 700 circulating proteins using an exploratory affinity proteomics approach. Linear association analyses were performed on case-control status and a data-driven analysis strategy was applied to cluster the women on their plasma proteome profiles in an unsupervised manner. The resulting clusters were subsequently annotated for the differences in phenotypic characteristics, clinical parameters, and genetic risk. Results: Using the data-driven approach we identified five clusters with distinct proteomic plasma profiles. Women in a particular sub-group (cluster 1) were significantly more likely to have used menopausal hormonal therapy (MHT), more likely to get a breast cancer diagnosis, and were older compared to the remaining clusters. The levels of circulating proteins in cluster 1 were decreased for proteins related to DNA repair and cell replication and increased for proteins related to mammographic density and female tissues. In contrast, classical dichotomous case-control analyses did not reveal any proteins significantly associated with future breast cancer. Conclusion: Using a data-driven approach, we identified a subset of women with circulating proteins associated with previous use of MHT and risk of breast cancer. Our findings point to the potential long-lasting effects of MHT on the circulating proteome even after ending the treatment, and hence provide valuable insights concerning risk predication of breast cancer.

Download Full-text

Quality control in the polypropylene manufacturing process: An efficient, data-driven approach

Journal of Applied Polymer Science ◽

10.1002/app.41312 ◽

2014 ◽

Vol 132 (3) ◽

pp. n/a-n/a ◽

Cited By ~ 1

Author(s):

Zhong Cheng ◽

Xinggao Liu

Keyword(s):

Quality Control ◽

Manufacturing Process ◽

Data Driven ◽

Data Driven Approach ◽

Efficient Data

Download Full-text

Data Driven Approach to Forecast Building Occupant Complaints

Construction Research Congress 2020 ◽

10.1061/9780784482865.019 ◽

2020 ◽

Author(s):

Sena Assaf ◽

Mohamad Awada ◽

Issam Srour

Keyword(s):

Data Driven ◽

Data Driven Approach

Download Full-text

Analysis and Interpretation of Multi-Source Data at the Hydraulic Fracturing Test Site: A Data-Driven Approach to Improve Well Performance Evaluation in Heterogeneous Formations

Proceedings of the 2020 Latin America Unconventional Resources Technology Conference ◽

10.15530/urtec-2020-1560 ◽

2020 ◽

Author(s):

Shadi Salahshoor

Keyword(s):

Performance Evaluation ◽

Hydraulic Fracturing ◽

Test Site ◽

Data Driven ◽

Well Performance ◽

Source Data ◽

Data Driven Approach ◽

Heterogeneous Formations

Download Full-text

Extended-Range Prediction with Low-Dimensional, Stochastic-Dynamic Models: A Data-driven Approach

10.21236/ada572180 ◽

2012 ◽

Author(s):

Michael Ghil ◽

Mickael D. Chekroun ◽

Dmitri Kondrashov ◽

Michael K. Tippett ◽

Andrew Robertson ◽

...

Keyword(s):

Dynamic Models ◽

Data Driven ◽

Stochastic Dynamic ◽

Extended Range ◽

Data Driven Approach ◽

Low Dimensional

Download Full-text