Efficient coalescent simulation and genealogical analysis for large sample sizes

Mapping Intimacies ◽

10.1101/033118 ◽

2015 ◽

Cited By ~ 2

Author(s):

Jerome Kelleher ◽

Alison M Etheridge ◽

Gil McVean

Keyword(s):

Genetic Variation ◽

Long Range ◽

Coalescent Simulation ◽

Sample Sizes ◽

Approximate Methods ◽

Exact Simulation ◽

Large Sample ◽

Genealogical Analysis ◽

Coalescent Simulations ◽

Shared Structure

A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.

Download Full-text

Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes

PLoS Computational Biology ◽

10.1371/journal.pcbi.1004842 ◽

2016 ◽

Vol 12 (5) ◽

pp. e1004842 ◽

Cited By ~ 215

Author(s):

Jerome Kelleher ◽

Alison M Etheridge ◽

Gilean McVean

Keyword(s):

Coalescent Simulation ◽

Sample Sizes ◽

Large Sample ◽

Genealogical Analysis

Download Full-text

Faculty Opinions recommendation of Efficient coalescent simulation and genealogical analysis for large sample sizes.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.726331054.793533575 ◽

2017 ◽

Author(s):

Yun S Song

Keyword(s):

Coalescent Simulation ◽

Sample Sizes ◽

Large Sample ◽

Genealogical Analysis

Download Full-text

High heterogeneity undermines generalization of differential expression results in RNA-Seq analysis

Human Genomics ◽

10.1186/s40246-021-00308-5 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Weitong Cui ◽

Huaru Xue ◽

Lei Wei ◽

Jinghua Jin ◽

Xuewen Tian ◽

...

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Small Sample ◽

Differentially Expressed ◽

Cancer Type ◽

Rna Seq ◽

Sample Sizes ◽

Large Sample ◽

Expression Levels ◽

Gene Expression Levels

Abstract Background RNA sequencing (RNA-Seq) has been widely applied in oncology for monitoring transcriptome changes. However, the emerging problem that high variation of gene expression levels caused by tumor heterogeneity may affect the reproducibility of differential expression (DE) results has rarely been studied. Here, we investigated the reproducibility of DE results for any given number of biological replicates between 3 and 24 and explored why a great many differentially expressed genes (DEGs) were not reproducible. Results Our findings demonstrate that poor reproducibility of DE results exists not only for small sample sizes, but also for relatively large sample sizes. Quite a few of the DEGs detected are specific to the samples in use, rather than genuinely differentially expressed under different conditions. Poor reproducibility of DE results is mainly caused by high variation of gene expression levels for the same gene in different samples. Even though biological variation may account for much of the high variation of gene expression levels, the effect of outlier count data also needs to be treated seriously, as outlier data severely interfere with DE analysis. Conclusions High heterogeneity exists not only in tumor tissue samples of each cancer type studied, but also in normal samples. High heterogeneity leads to poor reproducibility of DEGs, undermining generalization of differential expression results. Therefore, it is necessary to use large sample sizes (at least 10 if possible) in RNA-Seq experimental designs to reduce the impact of biological variability and DE results should be interpreted cautiously unless soundly validated.

Download Full-text

Very large sample sizes

BMJ ◽

10.1136/bmj.b737 ◽

2009 ◽

Vol 338 (feb25 2) ◽

pp. b737-b737 ◽

Cited By ~ 1

Author(s):

J. Fletcher

Keyword(s):

Sample Sizes ◽

Large Sample

Download Full-text

Concentration inequalities for the empirical distribution of discrete distributions: beyond the method of types

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaz025 ◽

2019 ◽

Vol 9 (4) ◽

pp. 813-850 ◽

Cited By ~ 7

Author(s):

Jay Mardia ◽

Jiantao Jiao ◽

Ervin Tánczos ◽

Robert D Nowak ◽

Tsachy Weissman

Keyword(s):

Sample Size ◽

Empirical Distribution ◽

Discrete Distributions ◽

Concentration Inequalities ◽

Sample Sizes ◽

Alphabet Size ◽

Large Sample ◽

Kl Divergence ◽

The Difference ◽

True Distribution

Abstract We study concentration inequalities for the Kullback–Leibler (KL) divergence between the empirical distribution and the true distribution. Applying a recursion technique, we improve over the method of types bound uniformly in all regimes of sample size $n$ and alphabet size $k$, and the improvement becomes more significant when $k$ is large. We discuss the applications of our results in obtaining tighter concentration inequalities for $L_1$ deviations of the empirical distribution from the true distribution, and the difference between concentration around the expectation or zero. We also obtain asymptotically tight bounds on the variance of the KL divergence between the empirical and true distribution, and demonstrate their quantitatively different behaviours between small and large sample sizes compared to the alphabet size.

Download Full-text

Confidence and coverage for Bland–Altman limits of agreement and their approximate confidence intervals

Statistical Methods in Medical Research ◽

10.1177/0962280216665419 ◽

2016 ◽

Vol 27 (5) ◽

pp. 1559-1574 ◽

Cited By ~ 22

Author(s):

Andrew Carkeet ◽

Yee Teng Goh

Keyword(s):

Confidence Intervals ◽

Small Sample ◽

Interval Methods ◽

Tolerance Interval ◽

Confidence Limits ◽

Sample Sizes ◽

Limits Of Agreement ◽

Approximate Methods ◽

Exact Confidence Intervals ◽

Tolerance Factors

Bland and Altman described approximate methods in 1986 and 1999 for calculating confidence limits for their 95% limits of agreement, approximations which assume large subject numbers. In this paper, these approximations are compared with exact confidence intervals calculated using two-sided tolerance intervals for a normal distribution. The approximations are compared in terms of the tolerance factors themselves but also in terms of the exact confidence limits and the exact limits of agreement coverage corresponding to the approximate confidence interval methods. Using similar methods the 50th percentile of the tolerance interval are compared with the k values of 1.96 and 2, which Bland and Altman used to define limits of agreements (i.e. [Formula: see text]+/− 1.96Sd and [Formula: see text]+/− 2Sd). For limits of agreement outer confidence intervals, Bland and Altman’s approximations are too permissive for sample sizes <40 (1999 approximation) and <76 (1986 approximation). For inner confidence limits the approximations are poorer, being permissive for sample sizes of <490 (1986 approximation) and all practical sample sizes (1999 approximation). Exact confidence intervals for 95% limits of agreements, based on two-sided tolerance factors, can be calculated easily based on tables and should be used in preference to the approximate methods, especially for small sample sizes.

Download Full-text

Attributes Control Charts with Large Sample Sizes

Journal of Quality Technology ◽

10.1080/00224065.1996.11979703 ◽

1996 ◽

Vol 28 (4) ◽

pp. 451-459 ◽

Cited By ~ 19

Author(s):

Peter A. Heimann

Keyword(s):

Control Charts ◽

Sample Sizes ◽

Large Sample

Download Full-text

Bringing data to the surface: recovering data loggers for large sample sizes from marine vertebrates

Animal Biotelemetry ◽

10.1186/s40317-016-0105-8 ◽

2016 ◽

Vol 4 (1) ◽

Cited By ~ 9

Author(s):

Karissa O. Lear ◽

Nicholas M. Whitney

Keyword(s):

Sample Sizes ◽

Large Sample ◽

Data Loggers ◽

Marine Vertebrates

Download Full-text

What's the Score?

Infection Control and Hospital Epidemiology ◽

10.1086/501701 ◽

2000 ◽

Vol 21 (1) ◽

pp. 57-58

Author(s):

David Birnbaum

Keyword(s):

Confidence Interval ◽

Infection Rate ◽

Binomial Distribution ◽

Sample Sizes ◽

Negative Numbers ◽

Large Sample ◽

Infection Surveillance

AbstractIf you have calculated a confidence interval for an infection rate and found the interval extending into meaningless negative numbers, chances are the error is due to use of approximation formulae. Many of us unknowingly were taught to use the Wald approximation, which does not always approximate the exact binomial distribution accurately. Poor approximation can occur in infection surveillance at both small and large sample sizes.

Download Full-text

Statistical process control charts for attribute data involving very large sample sizes: a review of problems and solutions

BMJ Quality & Safety ◽

10.1136/bmjqs-2012-001373 ◽

2013 ◽

Vol 22 (4) ◽

pp. 362-368 ◽

Cited By ~ 12

Author(s):

Mohammed A Mohammed ◽

Jagdeep S Panesar ◽

David B Laney ◽

Richard Wilson

Keyword(s):

Process Control ◽

Statistical Process Control ◽

Control Charts ◽

Sample Sizes ◽

Statistical Process ◽

Large Sample ◽

Attribute Data ◽

Process Control Charts ◽

Problems And Solutions ◽

Statistical Process Control Charts

Download Full-text