scholarly journals Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

2014 ◽  
Author(s):  
Anand Bhaskar ◽  
Y.X. Rachel Wang ◽  
Yun S. Song

With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal which is difficult to pick up with small sample sizes. Lastly, we apply our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing dataset of tens of thousands of individuals assayed at a few hundred genic regions.

2021 ◽  
Vol 15 (1) ◽  
Author(s):  
Weitong Cui ◽  
Huaru Xue ◽  
Lei Wei ◽  
Jinghua Jin ◽  
Xuewen Tian ◽  
...  

Abstract Background RNA sequencing (RNA-Seq) has been widely applied in oncology for monitoring transcriptome changes. However, the emerging problem that high variation of gene expression levels caused by tumor heterogeneity may affect the reproducibility of differential expression (DE) results has rarely been studied. Here, we investigated the reproducibility of DE results for any given number of biological replicates between 3 and 24 and explored why a great many differentially expressed genes (DEGs) were not reproducible. Results Our findings demonstrate that poor reproducibility of DE results exists not only for small sample sizes, but also for relatively large sample sizes. Quite a few of the DEGs detected are specific to the samples in use, rather than genuinely differentially expressed under different conditions. Poor reproducibility of DE results is mainly caused by high variation of gene expression levels for the same gene in different samples. Even though biological variation may account for much of the high variation of gene expression levels, the effect of outlier count data also needs to be treated seriously, as outlier data severely interfere with DE analysis. Conclusions High heterogeneity exists not only in tumor tissue samples of each cancer type studied, but also in normal samples. High heterogeneity leads to poor reproducibility of DEGs, undermining generalization of differential expression results. Therefore, it is necessary to use large sample sizes (at least 10 if possible) in RNA-Seq experimental designs to reduce the impact of biological variability and DE results should be interpreted cautiously unless soundly validated.


2021 ◽  
Vol 19 (1) ◽  
pp. 2-25
Author(s):  
Seongah Im

This study examined performance of the beta-binomial model in comparison with GEE using clustered binary responses resulting in non-normal outcomes. Monte Carlo simulations were performed under varying intracluster correlations and sample sizes. The results showed that the beta-binomial model performed better for small sample, while GEE performed well under large sample.


Author(s):  
David V Glidden

Abstract With the scale-up of HIV pre-exposure prophylaxis (PrEP) with tenofovir (TDF) with or without emtricitabine (FTC), we have entered an era of highly effective HIV prevention with a growing pipeline of potential products to be studied. These studies are likely to be randomized trials with an oral TDF/FTC control arm. These studies require comparison of incident infections and can be time and resource intensive. Conventional approaches for design and analysis active controlled trial can lead to very large sample sizes. We demonstrate the important of assumptions about background infections for interpreting trial results and suggest alternative criteria for demonstrating the efficacy and effectiveness of potential PrEP agents.


2021 ◽  
Author(s):  
Gautam Upadhya ◽  
Matthias Steinruecken

Unraveling the complex demographic histories of natural populations is a central problem in population genetics. Understanding past demographic events is of general anthropological interest and is also an important step in establishing accurate null models when identifying adaptive or disease-associated genetic variation. An important class of tools for inferring past population size changes are Coalescent Hidden Markov Models (CHMMs). These models make efficient use of the linkage information in population genomic datasets by using the local genealogies relating sampled individuals as latent states that evolve across the chromosome in an HMM framework. Extending these models to large sample sizes is challenging, since the number of possible latent states increases rapidly. Here, we present our method CHIMP (CHMM History-Inference ML Procedure), a novel CHMM method for inferring the size history of a population. It can be applied to large samples (hundreds of haplotypes) and only requires unphased genomes as input. The two implementations of CHIMP that we present here use either the height of the genealogical tree (TMRCA) or the total branch length, respectively, as the latent variable at each position in the genome. The requisite transition and emission probabilities are obtained by numerically solving certain systems of differential equations derived from the ancestral process with recombination. The parameters of the population size history are subsequently inferred using an Expectation-Maximization algorithm. In addition, we implement a composite likelihood scheme to allow the method to scale to large sample sizes. We demonstrate the efficiency and accuracy of our method in a variety of benchmark tests using simulated data and present comparisons to other state-of-the-art methods. Specifically, our implementation using TMRCA as the latent variable shows comparable performance and provides accurate estimates of effective population sizes in intermediate and ancient times. Our method is agnostic to the phasing of the data, which makes it a promising alternative in scenarios where high quality data is not available, and has potential applications for pseudo-haploid data.


2019 ◽  
Author(s):  
Götz Kersting ◽  
Arno Siri-Jégousse ◽  
Alejandro H. Wences

AbstractWe derive explicit formulas for the two first moments of he site frequency spectrum (SFSn,b)1≤b≤n−1 of the Bolthausen-Sznitman coalescent along with some precise and efficient approximations, even for small sample sizes n. These results provide new L2-asymptotics for some values of b = o(n). We also study the length of internal branches carrying b > n/2 individuals. In this case we obtain the distribution function and a convergence in law. Our results rely on the random recursive tree construction of the Bolthausen-Sznitman coalescent.


2018 ◽  
Author(s):  
Andrew Melfi ◽  
Divakar Viswanath

AbstractThe diversity in genomes is due to the accumulation of mutations and the site frequency spectrum (SFS) is a popular statistic for summarizing genomic data. The current coalescent algorithm for calculating the SFS for a given demography assumes the μ → 0 limit, where μ is the mutation probability (or rate) per base pair per generation. The algorithm is applicable when μN, N being the haploid population size, is negligible. We derive a coalescent based algorithm for calculating the SFS that allows the mutation rate μ(t) as well as the population size N(t) to vary arbitrarily as a function of time. That algorithm shows that the probability of two mutations in the genealogy becomes noticeable already for μ = 10-8 for samples of n = 105 haploid human genomes and increases rapidly with μ. Our algorithm calculates the SFS under the assumption of a single mutation in the genealogy, and the part of the SFS due to a single mutation depends only mildly on the finiteness of μ. However, the dependence of the SFS on variation in μ can be substantial for even n = 100 samples. In addition, increasing and decreasing mutation rates alter the SFS in different ways and to different extents.


2018 ◽  
Vol 13 (4) ◽  
pp. 403-408 ◽  
Author(s):  
Jeff Bodington ◽  
Manuel Malfeito-Ferreira

AbstractMuch research shows that women and men have different taste acuities and preferences. If female and male judges tend to assign different ratings to the same wines, then the gender balances of the judge panels will bias awards. Existing research supports the null hypothesis, however, that finding is based on small sample sizes. This article presents the results for a large sample; 260 wines and 1,736 wine-score observations. Subject to the strong qualification that non-gender-related variation is material, the results affirm that female and male judges do assign about the same ratings to the same wines. The expected value of the difference in their mean ratings is zero. (JEL Classifications: A10, C00, C10, C12, D12)


Author(s):  
Jessica F. McLaughlin ◽  
Kevin Winker

AbstractSample size is a critical aspect of study design in population genomics research, yet few empirical studies have examined the impacts of small sample sizes. We used datasets from eight diverging bird lineages to make pairwise comparisons at different levels of taxonomic divergence (populations, subspecies, and species). Our data are from loci linked to ultraconserved elements (UCEs) and our analyses used one SNP per locus. All individuals were genotyped at all loci (McLaughlin et al. 2020). We estimated population demographic parameters (effective population size, migration rate, and time since divergence) in a coalescent framework using Diffusion Approximation for Demographic Inference (δaδi; Gutenkunst et al. 2009), an allele frequency spectrum (AFS) method. Using divergence-with-gene-flow models optimized with full datasets, we subsampled at sequentially smaller sample sizes from full datasets of 6 – 8 diploid individuals per population (with both alleles called) down to 1:1, and then we compared estimates and their changes in accuracy. Accuracy was strongly affected by sample size, with considerable differences among estimated parameters and among lineages. Effective population size parameters (ν) tended to be underestimated at low sample sizes (fewer than 3 diploid individuals per population, or 6:6 haplotypes in coalescent terms). Migration (m) was fairly consistently estimated until ≤ 2 individuals per population, and no consistent trend of over- or underestimation was found in either time since divergence (T) or Θ (4Nrefμ). Lineages that were taxonomically recognized above the population level (subspecies and species pairs; i.e., deeper divergences) tended to have lower variation in scaled root mean square error (SMRSE) of parameter estimation at smaller sample sizes than population-level divergences, and many parameters were estimated accurately down to 3 diploid individuals per population. Shallower divergence levels (i.e., populations) often required at least 5 individuals per population for reliable demographic inferences using this approach. Although divergence levels might be unknown at the outset of study design, our results provide a framework for planning appropriate sampling and for interpreting results if smaller sample sizes must be used.


2019 ◽  
Vol 3 (Supplement_1) ◽  
pp. S556-S556
Author(s):  
Judy Poey ◽  
Laci Cornelison

Abstract Outcomes related to person-centered care in nursing homes have been difficult to ascertain. Much of the extant literature has suffered from differing definitions of what it means to be person-centered, variation in the levels of implementation of person-centered care that an organization has achieved, and small sample sizes. The PEAK program provides a unique opportunity to control for these variables across a large sample of nursing homes throughout the state of Kansas. This presentation will discuss the methodological advantages of evaluating the PEAK program and the findings from an evaluation of resident satisfaction in nursing homes at varying levels of implementation of person-centeredness.


Sign in / Sign up

Export Citation Format

Share Document