scholarly journals Efficient coalescent simulation and genealogical analysis for large sample sizes

2015 ◽  
Author(s):  
Jerome Kelleher ◽  
Alison M Etheridge ◽  
Gil McVean

A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.

2016 ◽  
Vol 12 (5) ◽  
pp. e1004842 ◽  
Author(s):  
Jerome Kelleher ◽  
Alison M Etheridge ◽  
Gilean McVean

2021 ◽  
Vol 15 (1) ◽  
Author(s):  
Weitong Cui ◽  
Huaru Xue ◽  
Lei Wei ◽  
Jinghua Jin ◽  
Xuewen Tian ◽  
...  

Abstract Background RNA sequencing (RNA-Seq) has been widely applied in oncology for monitoring transcriptome changes. However, the emerging problem that high variation of gene expression levels caused by tumor heterogeneity may affect the reproducibility of differential expression (DE) results has rarely been studied. Here, we investigated the reproducibility of DE results for any given number of biological replicates between 3 and 24 and explored why a great many differentially expressed genes (DEGs) were not reproducible. Results Our findings demonstrate that poor reproducibility of DE results exists not only for small sample sizes, but also for relatively large sample sizes. Quite a few of the DEGs detected are specific to the samples in use, rather than genuinely differentially expressed under different conditions. Poor reproducibility of DE results is mainly caused by high variation of gene expression levels for the same gene in different samples. Even though biological variation may account for much of the high variation of gene expression levels, the effect of outlier count data also needs to be treated seriously, as outlier data severely interfere with DE analysis. Conclusions High heterogeneity exists not only in tumor tissue samples of each cancer type studied, but also in normal samples. High heterogeneity leads to poor reproducibility of DEGs, undermining generalization of differential expression results. Therefore, it is necessary to use large sample sizes (at least 10 if possible) in RNA-Seq experimental designs to reduce the impact of biological variability and DE results should be interpreted cautiously unless soundly validated.


BMJ ◽  
2009 ◽  
Vol 338 (feb25 2) ◽  
pp. b737-b737 ◽  
Author(s):  
J. Fletcher
Keyword(s):  

2019 ◽  
Vol 9 (4) ◽  
pp. 813-850 ◽  
Author(s):  
Jay Mardia ◽  
Jiantao Jiao ◽  
Ervin Tánczos ◽  
Robert D Nowak ◽  
Tsachy Weissman

Abstract We study concentration inequalities for the Kullback–Leibler (KL) divergence between the empirical distribution and the true distribution. Applying a recursion technique, we improve over the method of types bound uniformly in all regimes of sample size $n$ and alphabet size $k$, and the improvement becomes more significant when $k$ is large. We discuss the applications of our results in obtaining tighter concentration inequalities for $L_1$ deviations of the empirical distribution from the true distribution, and the difference between concentration around the expectation or zero. We also obtain asymptotically tight bounds on the variance of the KL divergence between the empirical and true distribution, and demonstrate their quantitatively different behaviours between small and large sample sizes compared to the alphabet size.


2016 ◽  
Vol 27 (5) ◽  
pp. 1559-1574 ◽  
Author(s):  
Andrew Carkeet ◽  
Yee Teng Goh

Bland and Altman described approximate methods in 1986 and 1999 for calculating confidence limits for their 95% limits of agreement, approximations which assume large subject numbers. In this paper, these approximations are compared with exact confidence intervals calculated using two-sided tolerance intervals for a normal distribution. The approximations are compared in terms of the tolerance factors themselves but also in terms of the exact confidence limits and the exact limits of agreement coverage corresponding to the approximate confidence interval methods. Using similar methods the 50th percentile of the tolerance interval are compared with the k values of 1.96 and 2, which Bland and Altman used to define limits of agreements (i.e. [Formula: see text]+/− 1.96Sd and [Formula: see text]+/− 2Sd). For limits of agreement outer confidence intervals, Bland and Altman’s approximations are too permissive for sample sizes <40 (1999 approximation) and <76 (1986 approximation). For inner confidence limits the approximations are poorer, being permissive for sample sizes of <490 (1986 approximation) and all practical sample sizes (1999 approximation). Exact confidence intervals for 95% limits of agreements, based on two-sided tolerance factors, can be calculated easily based on tables and should be used in preference to the approximate methods, especially for small sample sizes.


2000 ◽  
Vol 21 (1) ◽  
pp. 57-58
Author(s):  
David Birnbaum

AbstractIf you have calculated a confidence interval for an infection rate and found the interval extending into meaningless negative numbers, chances are the error is due to use of approximation formulae. Many of us unknowingly were taught to use the Wald approximation, which does not always approximate the exact binomial distribution accurately. Poor approximation can occur in infection surveillance at both small and large sample sizes.


Sign in / Sign up

Export Citation Format

Share Document