Are methylation beta-values simplex distributed?

Mapping Intimacies ◽

10.1101/753459 ◽

2019 ◽

Author(s):

Lara Nonell ◽

Juan R González

Keyword(s):

Dna Methylation ◽

Sample Size ◽

Microarray Data ◽

Regression Models ◽

Real Data ◽

Small Sample ◽

Beta Regression ◽

Data Sets ◽

Methylation Data ◽

Simplex Distribution

AbstractDNA methylation plays an important role in the development and progression of disease. Beta-values are the standard methylation measures. Different statistical methods have been proposed to assess differences in methylation between conditions. However, most of them do not completely account for the distribution of beta-values. The simplex distribution can accommodate beta-values data. We hypothesize that simplex is a quite flexible distribution which is able to model methylation data.To test our hypothesis, we conducted several analyses using four real data sets obtained from microarrays and sequencing technologies. Standard data distributions were studied and modelled in comparison to the simplex. Besides, some simulations were conducted in different scenarios encompassing several distribution assumptions, regression models and sample sizes. Finally, we compared DNA methylation between females and males in order to benchmark the assessed methodologies under different scenarios.According to the results obtained by the simulations and real data analyses, DNA methylation data are concordant with the simplex distribution in many situations. Simplex regression models work well in small sample size data sets. However, when sample size increases, other models such as the beta regression or even the linear regression can be employed to assess group comparisons and obtain unbiased results. Based on these results, we can provide some practical recommendations when analyzing methylation data: 1) use data sets of at least 10 samples per studied condition for microarray data sets or 30 in NGS data sets, 2) apply a simplex or beta regression model for microarray data, 3) apply a linear model in any other case.

Download Full-text

Beta regression improves the detection of differential DNA methylation for epigenetic epidemiology

10.1101/054643 ◽

2016 ◽

Cited By ~ 5

Author(s):

Timothy J. Triche ◽

Peter W. Laird ◽

Kimberly D. Siegmund

Keyword(s):

Dna Methylation ◽

Regression Models ◽

Statistical Power ◽

Linear Models ◽

Marginal Likelihood ◽

Tissue Type ◽

Beta Regression ◽

Epigenetic Mark ◽

Data Sets ◽

Control Procedures

AbstractBackgroundDNA methylation is the most readily assayed epigenetic mark, possessing confirmed relationships with gene expression, imprinting, and chromatin accessibility.Given the increasingly widespread use of DNA methylation microarrays in population-scale epidemiological applications, we sought to determine which methods provided the greatest statistical power to reproducibly detect differences in DNA methylation across various conditions,using publicly available data sets on tissue type and aging.ResultsBeta regression, as proposed originally by Ferrari and Cribari-Neto, yielded more validated hits in each of our comparisons than any other method under consideration, both in a regression setting and in comparisons to two-group tests such as the Wilcoxon-Mann-Whitney, Student t, and Welch t tests.In large cohorts of whole blood samples, we corrected for compositional differences and batch effects, and found that marginal likelihood ratio tests from beta regression models uniformly dominate popular alternatives based on linear models.The superior sensitivity and specificity exhibited by beta regression in epidemiologically relevant cohort sizes corresponded to approximately a 2% increase in sensitivity at the same specificity when compared to linear models fitted on raw beta values (proportion of signal intensity due to the methylated allele), M-values, or rankquantile normalized values.ConclusionsInvestigators should consider beta regression to maximize statistical power in studies of DNA methylation using microarrays.At epidemiologically relevant sample sizes, with typical quality control procedures (compositional and batch effect correction), cross-cohort agreement uniformly favors beta regression over popular alternatives.

Download Full-text

Reliability estimation in multicomponent stress-strength based on Erlang-truncated exponential distribution

International Journal of Quality & Reliability Management ◽

10.1108/ijqrm-11-2012-0147 ◽

2017 ◽

Vol 34 (3) ◽

pp. 438-445 ◽

Cited By ~ 2

Author(s):

Srinivasa Rao Gadde

Keyword(s):

Sample Size ◽

Research Work ◽

Real Data ◽

Small Sample ◽

Reliability Estimation ◽

Data Sets ◽

Content Type ◽

Reliability Estimates ◽

Average Mean Square Error ◽

Estimate Reliability

Purpose The purpose of this paper is to consider the estimation of multicomponent stress-strength reliability. The system is regarded as alive only if at least s out of k (s<k) strengths exceed the stress. The reliability of such a system is obtained when strength, stress variates are from Erlang-truncated exponential (ETE) distribution with different shape parameters. The reliability is estimated using the maximum likelihood (ML) method of estimation when samples are drawn from strength and stress distributions. The reliability estimators are compared asymptotically. The small sample comparison of the reliability estimates is made through Monte Carlo simulation. Using real data sets the authors illustrate the procedure. Design/methodology/approach The authors have developed multicomponent stress-strength reliability based on ETE distribution. To estimate reliability, the parameters are estimated by using ML method. Findings The simulation results indicate that the average bias and average mean square error decreases as sample size increases for both methods of estimation in reliability. The length of the confidence interval also decreases as the sample size increases and simulated actual coverage probability is close to the nominal value in all sets of parameters considered here. Using real data, the authors illustrate the estimation process. Originality/value This research work has conducted independently and the results of the author’s research work are very useful for fresh researchers.

Download Full-text

Complete deconvolution of DNA methylation signals from complex tissues: a geometric approach

Bioinformatics ◽

10.1093/bioinformatics/btaa930 ◽

2020 ◽

Author(s):

Weiwei Zhang ◽

Hao Wu ◽

Ziyi Li

Keyword(s):

Dna Methylation ◽

Geometric Approach ◽

Real Data ◽

Cell Types ◽

Supplementary Information ◽

Data Sets ◽

Methylation Data ◽

Cell Type ◽

Tissue Samples ◽

Different Cell Types

Abstract Motivation It is a common practice in epigenetics research to profile DNA methylation on tissue samples, which is usually a mixture of different cell types. To properly account for the mixture, estimating cell compositions has been recognized as an important first step. Many methods were developed for quantifying cell compositions from DNA methylation data, but they mostly have limited applications due to lack of reference or prior information. Results We develop Tsisal, a novel complete deconvolution method which accurately estimate cell compositions from DNA methylation data without any prior knowledge of cell types or their proportions. Tsisal is a full pipeline to estimate number of cell types, cell compositions, and identify cell-type-specific CpG sites. It can also assign cell type labels when (full or part of) reference panel is available. Extensive simulation studies and analyses of seven real data sets demonstrate the favorable performance of our proposed method compared with existing deconvolution methods serving similar purpose. Availability The proposed method Tsisal is implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST. Contact [email protected] and [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The importance of input data quality and quantity in climate field reconstructions – results from the assimilation of various tree-ring collections

Climate of the Past ◽

10.5194/cp-16-1061-2020 ◽

2020 ◽

Vol 16 (3) ◽

pp. 1061-1074 ◽

Cited By ~ 2

Author(s):

Jörg Franke ◽

Veronika Valler ◽

Stefan Brönnimann ◽

Raphael Neukom ◽

Fernando Jaume-Santero

Keyword(s):

Sample Size ◽

Tree Ring ◽

Input Data ◽

Small Sample ◽

Screening Methods ◽

Temperature Sensitive ◽

Data Sets ◽

Large Sample Size ◽

Spatial Coverage ◽

Reconstruction Quality

Abstract. Differences between paleoclimatic reconstructions are caused by two factors: the method and the input data. While many studies compare methods, we will focus in this study on the consequences of the input data choice in a state-of-the-art Kalman-filter paleoclimate data assimilation approach. We evaluate reconstruction quality in the 20th century based on three collections of tree-ring records: (1) 54 of the best temperature-sensitive tree-ring chronologies chosen by experts; (2) 415 temperature-sensitive tree-ring records chosen less strictly by regional working groups and statistical screening; (3) 2287 tree-ring series that are not screened for climate sensitivity. The three data sets cover the range from small sample size, small spatial coverage and strict screening for temperature sensitivity to large sample size and spatial coverage but no screening. Additionally, we explore a combination of these data sets plus screening methods to improve the reconstruction quality. A large, unscreened collection generally leads to a poor reconstruction skill. A small expert selection of extratropical Northern Hemisphere records allows for a skillful high-latitude temperature reconstruction but cannot be expected to provide information for other regions and other variables. We achieve the best reconstruction skill across all variables and regions by combining all available input data but rejecting records with insignificant climatic information (p value of regression model >0.05) and removing duplicate records. It is important to use a tree-ring proxy system model that includes both major growth limitations, temperature and moisture.

Download Full-text

Filtering high-dimensional methylation marks with extremely small sample size: an application to gastric cancer data

10.21203/rs.3.rs-284773/v1 ◽

2021 ◽

Author(s):

Xin Chen ◽

Qingrun Zhang ◽

Thierry Chekouo

Keyword(s):

Gastric Cancer ◽

Dna Methylation ◽

Sample Size ◽

Small Sample Size ◽

Small Sample ◽

Differential Methylation ◽

High Dimensional ◽

Cancer Data ◽

Cancer Pathogenesis ◽

A Genome

Abstract Background: DNA methylations in critical regions are highly involved in cancer pathogenesis and drug response. However, to identify causal methylations out of a large number of potential polymorphic DNA methylation sites is challenging. This high-dimensional data brings two obstacles: first, many established statistical models are not scalable to so many features; second, multiple-test and overfitting become serious. To this end, a method to quickly filter candidate sites to narrow down targets for downstream analyses is urgently needed. Methods: BACkPAy is a pre-screening Bayesian approach to detect biological meaningful clusters of potential differential methylation levels with small sample size. BACkPAy prioritizes potentially important biomarkers by the Bayesian false discovery rate (FDR) approach. It filters non-informative sites (i.e. non-differential) with flat methylation pattern levels accross experimental conditions. In this work, we applied BACkPAy to a genome-wide methylation dataset with 3 tissue types and each type contains 3 gastric cancer samples. We also applied LIMMA (Linear Models for Microarray and RNA-Seq Data) to compare its results with what we achieved by BACkPAy. Then, Cox proportional hazards regression models were utilized to visualize prognostics significant markers with The Cancer Genome Atlas (TCGA) data for survival analysis. Results: Using BACkPAy, we identified 8 biological meaningful clusters/groups of differential probes from the DNA methylation dataset. Using TCGA data, we also identified five prognostic genes (i.e. predictive to the progression of gastric cancer) that contain some differential methylation probes, whereas no significant results was identified using the Benjamin-Hochberg FDR in LIMMA. Conclusions: We showed the importance of using BACkPAy for the analysis of DNA methylation data with extremely small sample size in gastric cancer. We revealed that RDH13, CLDN11, TMTC1, UCHL1 and FOXP2 can serve as predictive biomarkers for gastric cancer treatment and the promoter methylation level of these five genes in serum could have prognostic and diagnostic functions in gastric cancer patients.

Download Full-text

Characteristics and Determinants of New Start-ups in Gujarat, India

Entrepreneurship Review ◽

10.38157/entrepreneurship-review.v1i2.154 ◽

2020 ◽

Vol 1 (2) ◽

pp. 1-25

Author(s):

Ajay Kumar ◽

Bhim Jyoti

Keyword(s):

Sample Size ◽

Regression Models ◽

New Products ◽

Small Sample ◽

The Body ◽

High Tech ◽

Annual Sale ◽

Factors Affecting ◽

Start Up ◽

Start Ups

Purpose: This study examines the relationship of socio-economic characteristics of start-ups with their size in Gujarat, India. It also assesses the determinants affecting the annual sale of start-ups. Methods: It includes primary information based on a survey of 120 founders of start-ups. Linear and semi-log linear regression models have been applied to assess the determinants of start-ups. Probit regression models have been considered to assess the factors affecting the annual sale of the start-ups. Results: Stage of start-up, the participation of founders in conferences, educational qualification, and new products launched by start-ups, professional connections of founders, source of funding, and support from incubator/accelerator/supporting organizations are found crucial determinants of start-up size in Gujarat. The annual sales of the start-ups are positively associated with stage of start-up, support from a mentor, team members, founder's academic qualification, and collaboration with national or international organizations, unskilled workers. Implications: Technology transfer and commercialization, development of new products, government regulations, the requirement of costumers, free rights for entrepreneurs, appropriate financial support for new entrepreneurs, transparency and clarity in government policies, the establishment of high-tech start-ups, and development of digital infrastructure, increase in R&D spending in research academia, and association of research institutions with entrepreneurs would be conducive to create an appropriate start-ups ecosystem and to reduce regional development disparities across Indian states. Subsequently, it would be helpful to increase sustainable development in India. Originality: This study has used primary information of 120 founders of start-ups to assess the determinants, and the factors affecting annual sales of start-ups using the regression model in, Gujrat, India. Thus, it has an empirical contribution to the body of knowledge. Limitations: This study could not provide rational justifications on most factors that show an insignificant impact on start-ups due to the small sample size. Further research, therefore, may be considered to identify the association of start-up size with the variables using a large sample size in India.

Download Full-text

Utilization of Epigenome-wide DNA Methylation for Longitudinal Comparison of White Blood Cell Proportions Across Preeclamptic and Normotensive Pregnancy by Self-Reported Race

10.1101/2020.09.18.20197491 ◽

2020 ◽

Cited By ~ 1

Author(s):

Mitali Ray ◽

Lacey W. Heinsberg ◽

Yvette P. Conley ◽

James M. Roberts ◽

Arun Jeyabalan ◽

...

Keyword(s):

Dna Methylation ◽

Blood Cell ◽

B Cell ◽

White Blood Cell ◽

Nk Cell ◽

Calibration Method ◽

Small Sample ◽

Methylation Data ◽

Longitudinal Comparison ◽

The Relationship

Objective: We utilized epigenome-wide DNA methylation data to estimate/compare white blood cell (WBC) proportions in plasma across preeclamptic (case) and uncomplicated, normotensive (control) pregnancy. Methods: We previously collected methylation data using Infinium MethylationEPIC Beadchips during the three trimesters in 28 cases and 28 controls (21 Black, 7 White participants/group). We employed the Houseman regression calibration method to estimate and compare neutrophil, monocyte, B cell, NK cell, CD4+ T and CD8+ T cell proportions across pregnancy and between cases and controls. Results: We observed changes in WBC proportions across pregnancy within cases and controls that varied by cell type and race. Neutrophils represented the largest WBC mean proportion in all three trimesters for cases (Mean+/-SD: 67.2+/-9.6% to 74.4+/-12%) and controls (64.2+/-11% to 74.0+/-7.9%). Mean B cell proportions were significantly lower in cases than controls in Trimester 1 (5.25+/-0.02% versus 6.30+/-0.02%, p=0.02). The remaining mean cell proportions did not significantly differ in the overall sample. Stratified analyses revealed race-specific differences. In White participants (n=14): (1) neutrophil proportions were significantly higher in cases in Trimester 1 (p=0.04), but significantly lower in Trimester 2 (p=0.02), (2) B cell proportions were significantly lower in cases in Trimester 1 (p=0.001). No significant differences were detected among Black participants (n=42). Conclusions: Although chronic inflammation characterizes preeclampsia, few studies have investigated WBCs across pregnancy. We report differences between cases and controls across pregnancy. Our findings in a small sample demonstrate the need for additional studies investigating the relationship between race and WBCs in pregnancy, which could provide insight into preeclampsia pathophysiology.

Download Full-text

DNA Methylation Markers for Pan-Cancer Prediction by Deep Learning

Genes ◽

10.3390/genes10100778 ◽

2019 ◽

Vol 10 (10) ◽

pp. 778 ◽

Cited By ~ 6

Author(s):

Liu ◽

Pan ◽

Li ◽

Yang ◽

...

Keyword(s):

Dna Methylation ◽

Deep Learning ◽

Sensitivity And Specificity ◽

Test Data ◽

Data Sets ◽

Methylation Data ◽

Average Sensitivity ◽

Validation Data ◽

Data Set ◽

Cancer Types

For cancer diagnosis, many DNA methylation markers have been identified. However, few studies have tried to identify DNA methylation markers to diagnose diverse cancer types simultaneously, i.e., pan-cancers. In this study, we tried to identify DNA methylation markers to differentiate cancer samples from the respective normal samples in pan-cancers. We collected whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples, and divided all samples into five data sets, including one training data set, one validation data set and three test data sets. We applied machine learning to identify DNA methylation markers, and specifically, we constructed diagnostic prediction models by deep learning. We identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers locate at cancer-related genes. With the CpG markers, our model achieved an average sensitivity and specificity on test data sets as 92.8% and 90.1%, respectively. For promoter markers, the average sensitivity and specificity on test data sets were 89.8% and 81.1%, respectively. Furthermore, in cell-free DNA methylation data of 163 prostate cancer samples, the CpG markers achieved the sensitivity as 100%, and the promoter markers achieved 92%. For both marker types, the specificity of normal whole blood was 100%. To conclude, we identified methylation markers to diagnose pan-cancers, which might be applied to liquid biopsy of cancers.

Download Full-text

armDNA: A functional beta model for detecting age-related genomewide DNA methylation marks

Statistical Methods in Medical Research ◽

10.1177/0962280216683571 ◽

2016 ◽

Vol 27 (9) ◽

pp. 2627-2640

Author(s):

Chenyang Wang ◽

Qi Shen ◽

Li Du ◽

Jinfeng Xu ◽

Hong Zhang

Keyword(s):

Dna Methylation ◽

Association Studies ◽

Rapid Development ◽

Complex Diseases ◽

Real Data ◽

Wald Test ◽

Methylation Data ◽

Unknown Parameters ◽

Diagnosis And Prognosis ◽

Age Related

DNA methylation has been shown to play an important role in many complex diseases. The rapid development of high-throughput DNA methylation scan technologies provides great opportunities for genomewide DNA methylation-disease association studies. As methylation is a dynamic process involving time, it is quite plausible that age contributes to its variation to a large extent. Therefore, in analyzing genomewide DNA methylation data, it is important to identify age-related DNA methylation marks and delineate their functional relationship. This helps us to better understand the underlying biological mechanism and facilitate early diagnosis and prognosis analysis of complex diseases. We develop a functional beta model for analyzing DNA methylation data and detecting age-related DNA methylation marks on the whole genome by naturally taking sampling scheme into account and accommodating flexible age-methylation dynamics. We focus on DNA methylation data obtained through the widely used bisulfite conversion technique and propose to use a beta model to relate the DNA methylation level to the age. Adjusting for certain confounders, the functional age effect is left completely unspecified, offering great flexibility and allowing extra data dynamics. An efficient algorithm is developed for estimating unknown parameters, and the Wald test is used to detect age-related DNA methylation marks. Simulation studies and several real data applications were provided to demonstrate the performance of the proposed method.

Download Full-text

Low variability in the underlying cellular landscape adversely affects the performance of interaction-based approaches for conducting cell-specific analyses of DNA methylation in bulk samples

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2021-0004 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Richard Meier ◽

Emily Nissen ◽

Devin C. Koestler

Keyword(s):

Dna Methylation ◽

Whole Blood ◽

Statistical Power ◽

Tissue Sample ◽

Estimation Error ◽

Cell Types ◽

Data Sets ◽

Methylation Data ◽

Interaction Terms ◽

Bulk Tissue

Abstract Statistical methods that allow for cell type specific DNA methylation (DNAm) analyses based on bulk-tissue methylation data have great potential to improve our understanding of human disease and have created unprecedented opportunities for new insights using the wealth of publicly available bulk-tissue methylation data. These methodologies involve incorporating interaction terms formed between the phenotypes/exposures of interest and proportions of the cell types underlying the bulk-tissue sample used for DNAm profiling. Despite growing interest in such “interaction-based” methods, there has been no comprehensive assessment how variability in the cellular landscape across study samples affects their performance. To answer this question, we used numerous publicly available whole-blood DNAm data sets along with extensive simulation studies and evaluated the performance of interaction-based approaches in detecting cell-specific methylation effects. Our results show that low cell proportion variability results in large estimation error and low statistical power for detecting cell-specific effects of DNAm. Further, we identified that many studies targeting methylation profiling in whole-blood may be at risk to be underpowered due to low variability in the cellular landscape across study samples. Finally, we discuss guidelines for researchers seeking to conduct studies utilizing interaction-based approaches to help ensure that their studies are adequately powered.

Download Full-text